Solutions
A
Assume that we are building a JIT compiler, which generates RISC-V instructions at runtime. Because a JIT must render and execute a native binary image at runtime, true machine-code JITs necessitate platforms that allow for data to be executed at runtime – code is data. Here is a simplified implementation based on C and RV32I: (filename: jit.c
)
Compile via the command below:
Run it with rv32emu.
We get the corresponding output.
Reference: Porting a JIT compiler to RISC-V: Challenges and Opportunities
addi a0, zero, ?
as shown in Line 8, what is the HEX value of A01?
- A01 = ?
0x3800513
The instruction would beaddi a0, zero, 56
Opcode addi: 0010011
destination register a0: 01010
funct3: 000
Source register zero: 00000
The immediate 56: 000000111000
000000111000 | 00000 | 000 | 01010 | 0010011
All together: 00000011100000000000010100010011
(In hex:0x3800513
)
0x00008067
(shown in Line 9) ? __ A02 __
- A02 = ?
jalr zero, ra, 0
ORjalr x0, x1, 0
- A03 = ?
至少要解釋 Reinterpret the array address as a function
B
The Python code for the recursive function count8
, which counts the number of 8's (base 16) in the input, is shown below. count8(0x988) = 2
and count8(0x81) = 1
, for instance. A RISC-V assembly-based implementation of the function is shown below.
B01
to make the assembly implementation match the Python code?
- B01 = ?
a1, a2, continue
count8
? __ B02 __
- B02 = ?
3
The instruction call count8
makes the first call to the function count8
in the program outside of the function definition. The program is interrupted right before the execution of lw ra, 0(sp)
at label L2
during a recursive call to count8. The layout of a memory area is shown in the diagram below. The values of all addresses and data are shown in hex. The sp register's current value, 0xEAC
, points to the location indicated in the diagram.
- | Address | Data |
---|---|---|
0xE94 | 0x9E4 | |
0xE98 | 0x0 | |
0xE9C | 0x0 | |
0xEA0 | __ B03 __ | |
0xEA4 | __ B04 __ | |
0xEA8 | __ B05 __ | |
sp→ | 0xEAC | 0x9E4 |
0xEB0 | 0x1 | |
0xEB4 | 0x82 | |
0xEB8 | 0xC8 | |
0xEBC | 0x73 | |
0xEC0 | 0x828 | |
0xEC4 | 0xC14 | |
0xEC8 | 0x82 |
0xEA0
, 0xEA4
, and 0xEA8
(answer in HEX)
- B03 = ?
0x9E4- B04 = ?
0x1- B05 = ?
0x8
Each stack frame stores ra, s1, a0. The current stack frame corresponds to a recursive call to count8. Therefore, 0x9E4 is the recursive ra value.
This means that 0xC8 is the original ra value. S1 is a 1 if the value shifted out in the previous recursion was an 8. A0 is the previous value of a0 shifted to the right by 4. 0x82 from address 0xEB4 shifted to the right by 4 gives us 0x8 for the value in 0xEA8. The s1 value corresponds to the value that was shifted out in the previous recursive call. Since that value was an 8, 0xEA4 is a 1.
We see that 0xE94 through 0xE9C are from count8(0x8), the 3 below are from count8(0x82), then 0xEAC to 0xEB4 are from count8(0x828), and 3 below are from count8(0x8288).
- B06 = ?
0x8288
In each iteration s1 is set to 1 if the least significant 4 bits equal 8 and 0 otherwise. The stack frame beginning at 0xEB8 corresponds to the initial call to count8. The value stored at 0xEC0 is the initial input shifted to the right by 4 (0x828). The s1 in the next stack frame (0xEB0) is the value of s1 corresponding to the 4 bits that were just shifted out.
Since this value is 1, it means that the least significant 4 bits equal 8. Therefore, the initial input was 0x8288.
count8
is interrupted? (answer in HEX)
- B07 = ?
0x9E4- B08 = ?
0x8- B09 = ?
0x1
a0
right when the execution of count8
is interrupted? (answer in HEX) __ B10 __
- B10 = ?
0x2
call f
instruction that made the initial call to count8? __ B11 __
- B11 = ?
0xC4
continue
label? __ B12 __
- B12 = ?
0x9D8
C
Take a look at the logic diagram below, which computes two outputs {x, y} from four inputs {a, b, c, d}. Calculate the circuit's tPD using the tPD details for the gate components listed in the table below.
Gate | tPD |
---|---|
XNOR2 | 5.5ns |
AND2 | 3.5ns |
OR2 | 2.5ns |
INV | 1.0ns |
tPD = __ C01 __ ns
- C01 = ?
9
We can examine the different paths that exist from each input to the outputs and calculate the critical delay for each.
The propagation time for the circuit is the slowest path from input to output, so tPD = 9.0 ns.
We can also avoid examining some of the paths by noticing that some of the paths differ only in that one path goes through an additional gate compared to another, and thus we should not even consider the shorter of those two paths.
D
Assume that you are creating G
circuit, as seen below. Everything has been running smoothly until you discover that the G circuit's throughput is too low one day. Your biggest client needs a throughput at least twice as high as what you can currently offer in order to include the G
circuit into their new product. You know precisely what to do: pipeline it as the G
circuit is now only combinational.
You choose to start small with a two-stage pipeline for your initial iteration. Calculate your circuit's overall throughput and latency.
- D01 = ?
24- D02 = ?
You decide to add an additional level after completing the first iteration because you are beginning to feel quite at ease with the G
circuit. Let's implement the maximal throughput pipeline. Kindly explain a pipeline using the following design that maximizes throughput while utilizing the fewest possible registers. Calculate your circuit's throughput and latency. NOTE: Use the fewest possible pipeline stages to maximize throughput in your pipelined circuit.
- D03 = ?
28- D04 = ?
Assume that we want to integrate another pipelined circuit, Circuit H, shown below. However, this circuit has multiple different timing specifications! In the below table, you have timing specifications for two different implementations of H: Version A, and Version B. Using your maximal-throughput G circuit from part C, select the version of H that minimizes your latency of the combined G-H circuit. Additionally, calculate the latency and throughput of the combined G-H circuit.
- | Version A | Version B |
---|---|---|
clock period | 7 ns | 8 ns |
stages | 2 | 1 |
Which one can minimize your latency of the combined G-H circuit? (Answer in Version A or Version B) __ D05 __
Overall latency = __ D06 __ ns
Overall throughput = __ D07 __ ns-1
- D05 = ?
Version B- D06 = ?
40
8 ns * 5 stages = 40 ns- D07 = ?
E
We are considering the following RISC-V processor designs. After running these 12 cycles, figure out what x3 and x4 are valued for processor 1. For each of the three processors, the code and the starting state of the necessary registers in the register file are shown below. Keep in mind that while x6 is in hexadecimal, registers x1 through x5 are given their values in decimal.
Register | Value |
---|---|
x1 | 5 |
x2 | 11 |
x3 | 30 |
x4 | 19 |
x5 | 20 |
x6 | 0x400 |
Processor 1: 5-stage pipelined RISC-V processor, which is fully bypassed and has branch annulment. Branch decisions are made in the EXE stage and branches are always predicted not taken. What are the values of registers x3 and x4 at the start of cycle 12 (in decimal):
- E01 = ?
32- E02 = ?
25
F
The P1 and the P2 are two RISC-V based designs whose performance you are expected to evaluate. You have made the choice to compare the performance of the two CPUs using the next construction step.
The P1 is a 4-stage pipelined processor with full bypassing. It attempts to improve performance by decreasing the number of cycles it takes to execute each instruction. The EXE
and MEM
stages are combined. Load requests are sent in the EXE
stage and received in the WB
stage one cycle after. The IF
stage speculatively fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a branch misprediction. Branches are resolved in the EXE
stage.
The P2 is a 6-stage pipelined processor with full bypassing. It attempts to improve performance by raising the clock speed. The EXE
stage is split into two stages to break up the critical path through the ALU. The IF
stage always fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a taken branch. ALU and Branch ALU results are available only in EXE2
.
- F01 = ?
8
- F02 = ?
13
- F03 = ?
P1: 8 cycles (6 ns/cycle) = 48 ns per loop iteration
P2: 13 cycles (t ns/cycle) = 48 ns per loop iteration
t = 48/13
G
Assume that while our data memory utilizes a little more realistic model that includes a DRAM, our instruction memory continues to respond in one cycle. A memory address, a read/write signal, and optional write data are all inputs to the memory. The CPU first uses the cache to fulfill read or write requests. The data can be quickly sent back to the pipeline if the access succeeds in the cache. It will take significantly longer for the CPU to reach the DRAM, and then retrieve the needed data from it.
Assume that each of the 2 processors has extra hardware that enables them to foresee when branches will be taken and to get the branch's destination instruction immediately following the branch instruction (without requiring any stalls).
Consider the following 2 RISC-V processor designs.
P1: Basic 5-stage pipeline with bypass paths from EXE to DEC and from WB to DEC
P2: Since the memory access can take a long time in the worst case, we use pipelining to improve the processor performance. The MEM
stage is further divided into three pipeline stages.
Assume the 2 processors have ideal registers (tSETUP = tHOLD = tPD = tCD = 0ns) between different pipeline stages. Please derive the minimum clock cycle given the following propagation delays for the logic blocks.
Logic Block | Propagation Delay |
---|---|
IF | 2 ns |
DEC | 2 ns |
EXE | 3 ns |
MEM | 10 ns |
WB | 2 ns |
bypass | 1.5 ns |
MEM1 | 4 ns |
MEM2 | 3 ns |
MEM3 | 3 ns |
- G01 = ?
10- G02 = ?
4.5
The code segment below copies elements from array A to array B, where Array A starts at address 0x100
, and Array B starts at address 0x500. Assume that this loop has been running for a long time, and that the processor always predicts that the branch will be taken.
How many cycles does this loop take to execute in P1? __ G03 __
- G03 = ?
6
Assume the memory can only serve one read or write operation at a time. In P2, since MEM1
, MEM2
and MEM3
stages all use the memory, we now have structural hazard. Structural hazard happens if multiple instructions try to use the same hardware resource. To resolve the structural hazard, we can only allow one memory access instruction exist in the MEM1
, MEM2
, and
MEM3 stages. Note that it is ok to have multiple other types of instructions, e.g., arithmetic instructions, as long as they do not need to use the memory. Specifically, we need some mechanism to stall memory access instructions.
The detailed stall mechanism is explained below. We will inject NOPs to the EXE stage when the IF and DEC stages are stalled. When the DEC has a memory access instruction, we stall for the following 2 cases:
These stalls are in addition to any stalls required to deal with data hazards. How many cycles does this loop take to execute in P2? __ G04 __
- G04 = ?
8