Quiz4 of Computer Architecture (2022 Fall)

--- tags: computer-arch --- # Quiz4 of Computer Architecture (2022 Fall) > Solutions ## Problem `A` Assume that we are building a [JIT compiler](https://en.wikipedia.org/wiki/Just-in-time_compilation), which generates RISC-V instructions at runtime. Because a JIT must render and execute a native binary image at runtime, true machine-code JITs necessitate platforms that allow for data to be executed at runtime -- code is data. Here is a simplified implementation based on C and RV32I: (filename: `jit.c`) ```c= #include <stdint.h> #include <stdio.h> typedef int func_t(void); int main(int argc, char *argv[]) { uint32_t instructions[] = { [0] = A01, /* addi a0, zero, ? */ [1] = 0x00008067, /* A02 */ }; /* Reinterpret the array address as a function */ func_t *jit = (func_t *) instructions; printf("%d\n", jit()); return 0; } ``` Compile via the command below: ```shell $ riscv-none-elf-gcc -O0 -march=rv32i -mabi=ilp32 -o jit.elf jit.c ``` Run it with [rv32emu](https://github.com/sysprog21/rv32emu). ```shell $ rv32emu jit.elf ``` We get the corresponding output. ``` 56 inferior exit code 0 ``` Reference: [Porting a JIT compiler to RISC-V: Challenges and Opportunities](https://hal.archives-ouvertes.fr/hal-03725841/document) 1. Given instruction `addi a0, zero, ?` as shown in Line 8, what is the HEX value of A01? > * A01 = ? > ==0x3800513== > The instruction would be `addi a0, zero, 56` > Opcode addi: 0010011 > destination register a0: 01010 > funct3: 000 > Source register zero: 00000 > The immediate 56: 000000111000 > `000000111000 | 00000 | 000 | 01010 | 0010011` > All together: 00000011100000000000010100010011 > (In hex: `0x3800513`) 2. What is the dissassmbly of `0x00008067` (shown in Line 9) ? __ A02 __ > * A02 = ? > ==`jalr zero, ra, 0`== OR ==`jalr x0, x1, 0`== 3. Explain why the above program works. __ A03 __ > * A03 = ? > 至少要解釋 Reinterpret the array address as a function --- ## Problem `B` The Python code for the recursive function `count8`, which counts the number of 8's (base 16) in the input, is shown below. `count8(0x988) = 2` and `count8(0x81) = 1`, for instance. A RISC-V assembly-based implementation of the function is shown below. - [ ] Python code ```python def count8(in): if in < 8: return 0 end = in & 0xF current_8 = 0 if end == 8: current_8 = 1 return current_8 + count8(in >> 4) ``` - [ ] The corresponding RISC-V assembly ```c count8: li a1, 8 bgt a1, a0, base addi sp, sp, -12 sw ra, 0(sp) sw s1, 4(sp) andi a2, a0, 0xF li s1, 0 L1: bne __ B01 __ addi s1, s1, 1 continue: srli a0, a0, 4 sw a0, 8(sp) call count8 add a0, a0, s1 L2: lw ra, 0(sp) lw s1, 4(sp) addi sp, sp, 12 j done base: li a0, 0 done: ret ``` 1. What should be `B01` to make the assembly implementation match the Python code? > * B01 = ? > ==`a1, a2, continue`== 2. How many words will be written to the stack before the program makes each recursive call to the function `count8`? __ B02 __ > * B02 = ? > ==3== The instruction call `count8` makes the first call to the function `count8` in the program outside of the function definition. The program is interrupted right before the execution of `lw ra, 0(sp)` at label `L2` during a recursive call to count8. The layout of a memory area is shown in the diagram below. The values of all addresses and data are shown in hex. The sp register's current value, `0xEAC`, points to the location indicated in the diagram. | - | Address | Data | |---|---------|------| | | 0xE94 | 0x9E4 | | | 0xE98 | 0x0 | | | 0xE9C | 0x0 | | | 0xEA0 | __ B03 __ | | | 0xEA4 | __ B04 __ | | | 0xEA8 | __ B05 __ | | sp→ | 0xEAC | 0x9E4 | | | 0xEB0 | 0x1 | | | 0xEB4 | 0x82 | | | 0xEB8 | 0xC8 | | | 0xEBC | 0x73 | | | 0xEC0 | 0x828 | | | 0xEC4 | 0xC14 | | | 0xEC8 | 0x82 | 3. Fill B03, B04, and B05 for the stack associated with `0xEA0`, `0xEA4`, and `0xEA8` (answer in HEX) > * B03 = ? > ==0x9E4== > * B04 = ? > ==0x1== > * B05 = ? > ==0x8== > Each stack frame stores ra, s1, a0. The current stack frame corresponds to a recursive call to count8. Therefore, 0x9E4 is the recursive ra value. > This means that 0xC8 is the original ra value. S1 is a 1 if the value shifted out in the previous recursion was an 8. A0 is the previous value of a0 shifted to the right by 4. 0x82 from address 0xEB4 shifted to the right by 4 gives us 0x8 for the value in 0xEA8. The s1 value corresponds to the value that was shifted out in the previous recursive call. Since that value was an 8, 0xEA4 is a 1. > We see that 0xE94 through 0xE9C are from count8(0x8), the 3 below are from count8(0x82), then 0xEAC to 0xEB4 are from count8(0x828), and 3 below are from count8(0x8288). 4. What is the initial input in at the initial call to count8? __ B06 __ (answer in HEX) > * B06 = ? > ==0x8288== > In each iteration s1 is set to 1 if the least significant 4 bits equal 8 and 0 otherwise. The stack frame beginning at 0xEB8 corresponds to the initial call to count8. The value stored at 0xEC0 is the initial input shifted to the right by 4 (0x828). The s1 in the next stack frame (0xEB0) is the value of s1 corresponding to the 4 bits that were just shifted out. > Since this value is 1, it means that the least significant 4 bits equal 8. Therefore, the initial input was 0x8288. 5. What are the values in the following registers right when the execution of `count8` is interrupted? (answer in HEX) * ra = B07 * a2 = B08 * s1 = B09 > * B07 = ? > ==0x9E4== > * B08 = ? > ==0x8== > * B09 = ? > ==0x1== 5. What is the value in register `a0` right when the execution of `count8` is interrupted? (answer in HEX) __ B10 __ > * B10 = ? > ==0x2== 7. What is the hex address of the `call f` instruction that made the initial call to count8? __ B11 __ > * B11 = ? > ==0xC4== 8. What is the hex address of the `continue` label? __ B12 __ > * B12 = ? > ==0x9D8== --- ## Problem `C` Take a look at the logic diagram below, which computes two outputs {x, y} from four inputs {a, b, c, d}. Calculate the circuit's t~PD~ using the t~PD~ details for the gate components listed in the table below. ![](https://hackmd.io/_uploads/rJDvcMxLj.png) | Gate | t~PD~ | |:-----:|:-----:| | XNOR2 | 5.5ns | | AND2 | 3.5ns | | OR2 | 2.5ns | | INV | 1.0ns | t~PD~ = __ C01 __ ns > * C01 = ? > ==9== > We can examine the different paths that exist from each input to the outputs and calculate the critical delay for each. > ![](https://hackmd.io/_uploads/HyWfizlIi.png) > ![](https://hackmd.io/_uploads/SkDXifl8o.png) > The propagation time for the circuit is the slowest path from input to output, so tPD = 9.0 ns. > We can also avoid examining some of the paths by noticing that some of the paths differ only in that one path goes through an additional gate compared to another, and thus we should not even consider the shorter of those two paths. --- ## Problem `D` Assume that you are creating `G` circuit, as seen below. Everything has been running smoothly until you discover that the G circuit's throughput is too low one day. Your biggest client needs a throughput at least twice as high as what you can currently offer in order to include the `G` circuit into their new product. You know precisely what to do: pipeline it as the `G` circuit is now only combinational. ![](https://hackmd.io/_uploads/HJNLTzgUi.png) 1. You choose to start small with a two-stage pipeline for your initial iteration. Calculate your circuit's overall throughput and latency. ![](https://hackmd.io/_uploads/rJQN17lIj.png) * Latency: __ D01 __ ns * Throughput: __ D02 __ ns^-1^ > * D01 = ? > ==24== > * D02 = ? > ==$\dfrac{1}{12}$== > ![](https://hackmd.io/_uploads/S1f3kQgLs.png) 2. You decide to add an additional level after completing the first iteration because you are beginning to feel quite at ease with the `G` circuit. Let's implement the maximal throughput pipeline. Kindly explain a pipeline using the following design that maximizes throughput while utilizing the fewest possible registers. Calculate your circuit's throughput and latency. NOTE: Use the fewest possible pipeline stages to maximize throughput in your pipelined circuit. ![](https://hackmd.io/_uploads/rym_BQgUi.png) * Latency: __ D03 __ ns * Throughput: __ D04 __ ns^-1^ > * D03 = ? > ==28== > * D04 = ? > ==$\dfrac{1}{7}$== > ![](https://hackmd.io/_uploads/SkX3BmxLs.png) 3. Assume that we want to integrate another pipelined circuit, Circuit H, shown below. However, this circuit has multiple different timing specifications! In the below table, you have timing specifications for two different implementations of H: Version A, and Version B. Using your maximal-throughput G circuit from part C, select the version of H that minimizes your latency of the combined G-H circuit. Additionally, calculate the latency and throughput of the combined G-H circuit. ![](https://hackmd.io/_uploads/rkw5gHgIi.png) | - | Version A | Version B | |---|-----------|-----------| | clock period | 7 ns | 8 ns | | stages | 2 | 1 | Which one can minimize your latency of the combined G-H circuit? (Answer in Version A or Version B) __ D05 __ Overall latency = __ D06 __ ns Overall throughput = __ D07 __ ns^-1^ > * D05 = ? > ==Version B== > * D06 = ? > ==40== > 8 ns * 5 stages = 40 ns > * D07 = ? > ==$\dfrac{1}{8}$== --- ## Problem `E` We are considering the following RISC-V processor designs. After running these 12 cycles, figure out what x3 and x4 are valued for processor 1. For each of the three processors, the code and the starting state of the necessary registers in the register file are shown below. Keep in mind that while x6 is in hexadecimal, registers x1 through x5 are given their values in decimal. ```c start: lw x1, 0(x6) addi x2, x0, 5 blt x1, x2, end addi x3, x2, 11 sub x4, x3, x1 xori x5, x6, 0x1 end: sub x4, x3, x2 addi x3, x4, 7 addi x3, x1, 3 . = 0x400 .word 0x1 ``` | Register | Value | |----------|-------| | x1 | 5 | | x2 | 11 | | x3 | 30 | | x4 | 19 | | x5 | 20 | | x6 | 0x400 | Processor 1: 5-stage pipelined RISC-V processor, which is fully bypassed and has branch annulment. Branch decisions are made in the EXE stage and branches are always predicted not taken. What are the values of registers x3 and x4 at the start of cycle 12 (in decimal): * x3: __ E01 __ * x4: __ E02 __ > * E01 = ? > ==32== > * E02 = ? > ==25== > ![](https://hackmd.io/_uploads/rkIZFml8i.png) --- ## Problem `F` The P~1~ and the P~2~ are two RISC-V based designs whose performance you are expected to evaluate. You have made the choice to compare the performance of the two CPUs using the next construction step. ```c L1: slli x11, x10, 2 lw x12, 0x80(x11) add x13, x13, x12 addi x10, x10, 1 blt x10, x14, L1 sub x14, x14, x15 xori x15, x14, 0x1 ``` The P~1~ is a 4-stage pipelined processor with full bypassing. It attempts to improve performance by decreasing the number of cycles it takes to execute each instruction. The `EXE` and `MEM` stages are combined. Load requests are sent in the `EXE` stage and received in the `WB` stage one cycle after. The `IF` stage speculatively fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a branch misprediction. Branches are resolved in the `EXE` stage. The P~2~ is a 6-stage pipelined processor with full bypassing. It attempts to improve performance by raising the clock speed. The `EXE` stage is split into two stages to break up the critical path through the ALU. The `IF` stage always fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a taken branch. ALU and Branch ALU results are available only in `EXE2`. 1. How many cycles does it take the P~1~ to execute one iteration of the loop? __ F01 __ > * F01 = ? > ==8== > ![](https://hackmd.io/_uploads/rJE8INxUj.png) > ![](https://hackmd.io/_uploads/rkNOINx8i.png) 2. How many cycles does it take the P~2~ to execute one iteration of the loop? __ F02 __ > * F02 = ? > ==13== > ![](https://hackmd.io/_uploads/Sy08IEgLs.png) > ![](https://hackmd.io/_uploads/HkaKU4gIi.png) 3. Assume that we can build the P~1~ with a clock period of 6ns. What is the clock period that the P~2~ must achieve to attain the same performance on this assembly sequence? __ F03 __ ns > * F03 = ? > ==$\dfrac{48}{13}$== > P~1~: 8 cycles (6 ns/cycle) = 48 ns per loop iteration > P~2~: 13 cycles (t ns/cycle) = 48 ns per loop iteration > t = 48/13 --- ## Problem `G` Assume that while our data memory utilizes a little more realistic model that includes a DRAM, our instruction memory continues to respond in one cycle. A memory address, a read/write signal, and optional write data are all inputs to the memory. The CPU first uses the cache to fulfill read or write requests. The data can be quickly sent back to the pipeline if the access succeeds in the cache. It will take significantly longer for the CPU to reach the DRAM, and then retrieve the needed data from it. Assume that each of the 2 processors has extra hardware that enables them to foresee when branches will be taken and to get the branch's destination instruction immediately following the branch instruction (without requiring any stalls). Consider the following 2 RISC-V processor designs. - [ ] P~1~: Basic 5-stage pipeline with bypass paths from EXE to DEC and from WB to DEC ![](https://hackmd.io/_uploads/SJujKVeIo.png) - [ ] P~2~: Since the memory access can take a long time in the worst case, we use pipelining to improve the processor performance. The `MEM` stage is further divided into three pipeline stages. ![](https://hackmd.io/_uploads/Bk9x9EeIs.png) 1. Assume the 2 processors have ideal registers (t~SETUP~ = t~HOLD~ = t~PD~ = t~CD~ = 0ns) between different pipeline stages. Please derive the minimum clock cycle given the following propagation delays for the logic blocks. | Logic Block | Propagation Delay | |:-----------:|:-----------------:| | IF | 2 ns | | DEC | 2 ns | | EXE | 3 ns | | MEM | 10 ns | | WB | 2 ns | | bypass | 1.5 ns | | MEM1 | 4 ns | | MEM2 | 3 ns | | MEM3 | 3 ns | * Clock period for P~1~: __ G01 __ ns * Clock period for P~2~: __ G02 __ ns > * G01 = ? > ==10== > * G02 = ? > ==4.5== 2. The code segment below copies elements from array A to array B, where Array A starts at address `0x100`, and Array B starts at address 0x500. Assume that this loop has been running for a long time, and that the processor always predicts that the branch will be taken. ```c add x1, x0, x0 add x2, x0, x0 Next: lw x3, 0x100(x1) addi x1, x1, 4 sw x3, 0x500(x2) addi x2, x2, 4 blt x1, x4, Next // x4=1000 ``` How many cycles does this loop take to execute in P~1~? __ G03 __ > * G03 = ? > ==6== > ![](https://hackmd.io/_uploads/B1na24lIi.png) 3. Assume the memory can only serve one read or write operation at a time. In P~2~, since `MEM1`, `MEM2` and `MEM3` stages all use the memory, we now have structural hazard. Structural hazard happens if multiple instructions try to use the same hardware resource. To resolve the structural hazard, we can only allow one memory access instruction exist in the `MEM1`, `MEM2`, `and` MEM3 stages. Note that it is ok to have multiple other types of instructions, e.g., arithmetic instructions, as long as they do not need to use the memory. Specifically, we need some mechanism to stall memory access instructions. The detailed stall mechanism is explained below. We will inject NOPs to the EXE stage when the IF and DEC stages are stalled. When the DEC has a memory access instruction, we stall for the following 2 cases: * If the EXE stage also has a memory access instruction, stall IF and DEC for 2 cycles * If the MEM1 stage also has a memory access instruction, stall IF and DEC for 1 cycle These stalls are in addition to any stalls required to deal with data hazards. How many cycles does this loop take to execute in P~2~? __ G04 __ > * G04 = ? > ==8== > ![](https://hackmd.io/_uploads/S1iypVeUs.png)