--- tags: computer-arch --- # Quiz4 of Computer Architecture (2022 Fall) :::info :information_source: General Information * You are allowed to read **[lecture materials](http://wiki.csie.ncku.edu.tw/arch/schedule)**. * That is, an open book exam. * We are using the honor system during this quiz, and would like you to accept the following: 1. You will not share the quiz with anyone. 2. You will not talk to anyone about the information on the quiz until the answers are made public. * Each answer has 4 points. * You must answer in valid numeric/C99/assembly representation and/or English alphabet except your formal name. * Message **[the instructor via Facebook](https://www.facebook.com/JservFans)** if you found something wrong. * :timer_clock: 09:10 ~ 10:15AM on Nov 15, 2022 ::: ## Problem `A` Assume that we are building a [JIT compiler](https://en.wikipedia.org/wiki/Just-in-time_compilation), which generates RISC-V instructions at runtime. Because a JIT must render and execute a native binary image at runtime, true machine-code JITs necessitate platforms that allow for data to be executed at runtime -- code is data. Here is a simplified implementation based on C and RV32I: (filename: `jit.c`) ```c= #include <stdint.h> #include <stdio.h> typedef int func_t(void); int main(int argc, char *argv[]) { uint32_t instructions[] = { [0] = A01, /* addi a0, zero, ? */ [1] = 0x00008067, /* A02 */ }; /* Reinterpret the array address as a function */ func_t *jit = (func_t *) instructions; printf("%d\n", jit()); return 0; } ``` Compile via the command below: ```shell $ riscv-none-elf-gcc -O0 -march=rv32i -mabi=ilp32 -o jit.elf jit.c ``` Run it with [rv32emu](https://github.com/sysprog21/rv32emu). ```shell $ rv32emu jit.elf ``` We get the corresponding output. ``` 56 inferior exit code 0 ``` Reference: [Porting a JIT compiler to RISC-V: Challenges and Opportunities](https://hal.archives-ouvertes.fr/hal-03725841/document) 1. Given instruction `addi a0, zero, ?` as shown in Line 8, what is the HEX value of A01? > * A01 = ? 2. What is the dissassmbly of `0x00008067` (shown in Line 9) ? __ A02 __ > * A02 = ? 3. Explain why the above program works. __ A03 __ > * A03 = ? --- ## Problem `B` The Python code for the recursive function `count8`, which counts the number of 8's (base 16) in the input, is shown below. `count8(0x988) = 2` and `count8(0x81) = 1`, for instance. A RISC-V assembly-based implementation of the function is shown below. - [ ] Python code ```python def count8(in): if in < 8: return 0 end = in & 0xF current_8 = 0 if end == 8: current_8 = 1 return current_8 + count8(in >> 4) ``` - [ ] The corresponding RISC-V assembly ```c count8: li a1, 8 bgt a1, a0, base addi sp, sp, -12 sw ra, 0(sp) sw s1, 4(sp) andi a2, a0, 0xF li s1, 0 L1: bne __ B01 __ addi s1, s1, 1 continue: srli a0, a0, 4 sw a0, 8(sp) call count8 add a0, a0, s1 L2: lw ra, 0(sp) lw s1, 4(sp) addi sp, sp, 12 j done base: li a0, 0 done: ret ``` 1. What should be `B01` to make the assembly implementation match the Python code? > * B01 = ? 2. How many words will be written to the stack before the program makes each recursive call to the function `count8`? __ B02 __ > * B02 = ? The instruction call `count8` makes the first call to the function `count8` in the program outside of the function definition. The program is interrupted right before the execution of `lw ra, 0(sp)` at label `L2` during a recursive call to `count8`. The layout of a memory area is shown in the diagram below. The values of all addresses and data are shown in hex. The sp register's current value, `0xEAC`, points to the location indicated in the diagram. | - | Address | Data | |---|---------|------| | | 0xE94 | 0x9E4 | | | 0xE98 | 0x0 | | | 0xE9C | 0x0 | | | 0xEA0 | __ B03 __ | | | 0xEA4 | __ B04 __ | | | 0xEA8 | __ B05 __ | | sp→ | 0xEAC | 0x9E4 | | | 0xEB0 | 0x1 | | | 0xEB4 | 0x82 | | | 0xEB8 | 0xC8 | | | 0xEBC | 0x73 | | | 0xEC0 | 0x828 | | | 0xEC4 | 0xC14 | | | 0xEC8 | 0x82 | 3. Fill B03, B04, and B05 for the stack associated with `0xEA0`, `0xEA4`, and `0xEA8` (answer in HEX) > * B03 = ? > * B04 = ? > * B05 = ? 4. What is the initial input in at the initial call to count8? __ B06 __ (answer in HEX) > * B06 = ? 5. What are the values in the following registers right when the execution of `count8` is interrupted? (answer in HEX) * ra = B07 * a2 = B08 * s1 = B09 > * B07 = ? > * B08 = ? > * B09 = ? 5. What is the value in register `a0` right when the execution of `count8` is interrupted? (answer in HEX) __ B10 __ > * B10 = ? 7. What is the hex address of the `call f` instruction that made the initial call to `count8`? __ B11 __ > * B11 = ? 8. What is the hex address of the `continue` label? __ B12 __ > * B12 = ? --- ## Problem `C` Take a look at the logic diagram below, which computes two outputs {x, y} from four inputs {a, b, c, d}. Calculate the circuit's t~PD~ using the t~PD~ details for the gate components listed in the table below. ![](https://hackmd.io/_uploads/rJDvcMxLj.png) | Gate | t~PD~ | |:-----:|:-----:| | XNOR2 | 5.5ns | | AND2 | 3.5ns | | OR2 | 2.5ns | | INV | 1.0ns | t~PD~ = __ C01 __ ns > * C01 = ? --- ## Problem `D` Assume that you are creating `G` circuit, as seen below. Everything has been running smoothly until you discover that the G circuit's throughput is too low one day. Your biggest client needs a throughput at least twice as high as what you can currently offer in order to include the `G` circuit into their new product. You know precisely what to do: pipeline it as the `G` circuit is now only combinational. ![](https://hackmd.io/_uploads/HJNLTzgUi.png) 1. You choose to start small with a two-stage pipeline for your initial iteration. Calculate your circuit's overall throughput and latency. ![](https://hackmd.io/_uploads/rJQN17lIj.png) * Latency: __ D01 __ ns * Throughput: __ D02 __ ns^-1^ > * D01 = ? > * D02 = ? 2. You decide to add an additional level after completing the first iteration because you are beginning to feel quite at ease with the `G` circuit. Let's implement the maximal throughput pipeline. Kindly explain a pipeline using the following design that maximizes throughput while utilizing the fewest possible registers. Calculate your circuit's throughput and latency. NOTE: Use the fewest possible pipeline stages to maximize throughput in your pipelined circuit. ![](https://hackmd.io/_uploads/rym_BQgUi.png) * Latency: __ D03 __ ns * Throughput: __ D04 __ ns^-1^ > * D03 = ? > * D04 = ? 3. Assume that we want to integrate another pipelined circuit, Circuit H, shown below. However, this circuit has multiple different timing specifications! In the below table, you have timing specifications for two different implementations of H: Version A, and Version B. Using your maximal-throughput G circuit from part C, select the version of H that minimizes your latency of the combined G-H circuit. Additionally, calculate the latency and throughput of the combined G-H circuit. ![](https://hackmd.io/_uploads/rkw5gHgIi.png) | - | Version A | Version B | |---|-----------|-----------| | clock period | 7 ns | 8 ns | | stages | 2 | 1 | * Which one can minimize your latency of the combined G-H circuit? (Answer in `Version A` or `Version B`) __ D05 __ * Overall latency = __ D06 __ ns * Overall throughput = __ D07 __ ns^-1^ > * D05 = ? > * D06 = ? > * D07 = ? --- ## Problem `E` We are considering the following RISC-V processor design. For this processor `P`, the code and the starting state of the necessary registers in the register file are shown below. Keep in mind that while `x6` is in hexadecimal, registers `x1` through `x5` are given their values in decimal. ```c start: lw x1, 0(x6) addi x2, x0, 5 blt x1, x2, end addi x3, x2, 11 sub x4, x3, x1 xori x5, x6, 0x1 end: sub x4, x3, x2 addi x3, x4, 7 addi x3, x1, 3 . = 0x400 .word 0x1 ``` | Register | Value | |----------|-------| | x1 | 5 | | x2 | 11 | | x3 | 30 | | x4 | 19 | | x5 | 20 | | x6 | 0x400 | Processor `P`: 5-stage pipelined RISC-V processor, which is fully bypassed and has branch annulment. Branch decisions are made in the `EXE` stage and branches are always predicted not taken. What are the values of registers `x3` and `x4` at the start of cycle 12 (in decimal): * x3: __ E01 __ * x4: __ E02 __ > * E01 = ? > * E02 = ? --- ## Problem `F` The P~1~ and the P~2~ are two RISC-V based designs whose performance you are expected to evaluate. You have made the choice to compare the performance of the two CPUs using the next construction step. ```c L1: slli x11, x10, 2 lw x12, 0x80(x11) add x13, x13, x12 addi x10, x10, 1 blt x10, x14, L1 sub x14, x14, x15 xori x15, x14, 0x1 ``` The P~1~ is a 4-stage pipelined processor with full bypassing. It attempts to improve performance by decreasing the number of cycles it takes to execute each instruction. The `EXE` and `MEM` stages are combined. Load requests are sent in the `EXE` stage and received in the `WB` stage one cycle after. The `IF` stage speculatively fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a branch misprediction. Branches are resolved in the `EXE` stage. The P~2~ is a 6-stage pipelined processor with full bypassing. It attempts to improve performance by raising the clock speed. The `EXE` stage is split into two stages to break up the critical path through the ALU. The `IF` stage always fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a taken branch. ALU and Branch ALU results are available only in `EXE2`. 1. How many cycles does it take the P~1~ to execute one iteration of the loop? __ F01 __ > * F01 = ? 2. How many cycles does it take the P~2~ to execute one iteration of the loop? __ F02 __ > * F02 = ? 3. Assume that we can build the P~1~ with a clock period of 6ns. What is the clock period that the P~2~ must achieve to attain the same performance on this assembly sequence? __ F03 __ ns > * F03 = ? --- ## Problem `G` Assume that while our data memory utilizes a little more realistic model that includes a DRAM, our instruction memory continues to respond in one cycle. A memory address, a read/write signal, and optional write data are all inputs to the memory. The CPU first uses the cache to fulfill read or write requests. The data can be quickly sent back to the pipeline if the access succeeds in the cache. It will take significantly longer for the CPU to reach the DRAM, and then retrieve the needed data from it. Assume that each of the 2 processors has extra hardware that enables them to foresee when branches will be taken and to get the branch's destination instruction immediately following the branch instruction (without requiring any stalls). Consider the following 2 RISC-V processor designs. - [ ] P~1~: Basic 5-stage pipeline with bypass paths from EXE to DEC and from WB to DEC ![](https://hackmd.io/_uploads/SJujKVeIo.png) - [ ] P~2~: Since the memory access can take a long time in the worst case, we use pipelining to improve the processor performance. The `MEM` stage is further divided into three pipeline stages. ![](https://hackmd.io/_uploads/Bk9x9EeIs.png) 1. Assume the 2 processors have ideal registers (t~SETUP~ = t~HOLD~ = t~PD~ = t~CD~ = 0ns) between different pipeline stages. Please derive the minimum clock cycle given the following propagation delays for the logic blocks. | Logic Block | Propagation Delay | |:-----------:|:-----------------:| | IF | 2 ns | | DEC | 2 ns | | EXE | 3 ns | | MEM | 10 ns | | WB | 2 ns | | bypass | 1.5 ns | | MEM1 | 4 ns | | MEM2 | 3 ns | | MEM3 | 3 ns | * Clock period for P~1~: __ G01 __ ns * Clock period for P~2~: __ G02 __ ns > * G01 = ? > * G02 = ? 2. The code segment below copies elements from array A to array B, where Array A starts at address `0x100`, and Array B starts at address 0x500. Assume that this loop has been running for a long time, and that the processor always predicts that the branch will be taken. ```c add x1, x0, x0 add x2, x0, x0 Next: lw x3, 0x100(x1) addi x1, x1, 4 sw x3, 0x500(x2) addi x2, x2, 4 blt x1, x4, Next // x4=1000 ``` How many cycles does this loop take to execute in P~1~? __ G03 __ > * G03 = ? 3. Assume the memory can only serve one read or write operation at a time. In P~2~, since `MEM1`, `MEM2` and `MEM3` stages all use the memory, we now have structural hazard. Structural hazard happens if multiple instructions try to use the same hardware resource. To resolve the structural hazard, we can only allow one memory access instruction exist in the `MEM1`, `MEM2`, `and` MEM3 stages. Note that it is ok to have multiple other types of instructions, e.g., arithmetic instructions, as long as they do not need to use the memory. Specifically, we need some mechanism to stall memory access instructions. The detailed stall mechanism is explained below. We will inject NOPs to the `EXE` stage when the `IF` and `DEC` stages are stalled. When the `DEC` has a memory access instruction, we stall for the following 2 cases: * If the `EXE` stage also has a memory access instruction, stall `IF` and `DEC` for 2 cycles * If the `MEM1` stage also has a memory access instruction, stall `IF` and `DEC` for 1 cycle These stalls are in addition to any stalls required to deal with data hazards. How many cycles does this loop take to execute in P~2~? __ G04 __ > * G04 = ?