Solutions
A
Implement a RISC-V function called sum_of_squares
that, given an integer n
, calculates the following sum:
If is not a positive integer (n ≤ 0
), the function should return the value of .
This task requires a recursive algorithm that uses the a1
register to keep track of the accumulated sum. You will implement the following function:
sum_of_squares
a0
: A 32-bit integer representing n
. You can assume that n ≤ 10000
.a1
: A 32-bit integer representing m
.a0
the result of:n ≤ 0
, return the value of m
.You have access to a helper function square
that computes the square of an integer.
square
a0
: An integer n
.a0
will hold the value of .To implement the recursive logic, consider updating the value of m
at each step. Specifically, let:
Then, the original sum:
can be rewritten as:
This transformation allows you to reduce the problem size by decrementing and updating with the new accumulated sum (). Use this approach to implement the recursive function and follow the RISC-V calling convention.
You are required to complete the RISC-V assembly code above, ensuring that no pseudoinstructions are used.
A01 = blt x0, a0, recurse_case
A02 = jalr x0, ra, 0
A03 = addi sp, sp, -12
A04 = sw ra, 8(sp)
A05 = jal ra, square
A06 = lw ra, 8(sp)
A07 = addi sp, sp, 12
A08 = addi sp, sp, -4
A09 = jal ra, sum_of_squares
A10 = addi sp, sp, 4
A11 = jalr x0, ra, 0
From the perspective of the callee, is it necessary to save any registers in the prologue and epilogue of the function? If so, which registers need to be saved, and where should the prologue and epilogue be placed? If not, provide a brief explanation as to why. __ A12 __
A12 = (一定要有 No 或 Not,意思相近就給分)
No, we do not need to consider callee-saved registers because we are not using any in this function. However, since the function makes two function calls, it may be necessary to save the
ra
register in the prologue and restore it in the epilogue, right before thejalr x0, ra, 0
instruction, just before thezero_case
label.
Next, we will implement the square
function, which calculates the square of an integer (a0 = n
) and returns the result in a0
. This function adheres to the RISC-V calling convention and avoids using RV32M instructions, ensuring compatibility with the RV32I base ISA.
A13 = sw ra, 0(sp)
A14 = and t3, t2, x1
A15 = slli t1, t1, 1
A16 = srli t2, t2, 1
A17 = bne t2, x0, square_loop
A18 = lw ra, 0(sp)
A19 = jalr x0, ra, 0
B
Similar to logic gates, registers also experience a delay before their output reflects the sampled input. This delay is called the clock-to-Q (clk-to-q) delay, where "Q" typically denotes the output. The clk-to-q delay is the time between the rising edge of the clock signal and when the register's output updates to match the input.
In the circuit below:
Determine:
B01 = 14
B02 = 25
B03 = 40M
The maximum allowable hold time for RegC is determined by the time it takes for the input of RegC to change. This is calculated as the clk-to-q delay of RegA or RegB plus the shortest combinational logic delay:
The minimum clock cycle time must account for the clk-to-q delay, the longest combinational logic delay, and the setup time of RegC:
A clock cycle time of 25 ns corresponds to a clock frequency of:
C
Consider a left-complete binary search tree represented as an array. For example:
The corresponding binary tree structure is:
To find the position of elements in sorted order, we perform an in-order traversal (left-root-right) on this tree. The sorted order is:
Let's go through two specific examples:
nodes
), 3 is at index 8 (0-based).nodes
), 7 is at index 10 (0-based).This example demonstrates how elements in a left-complete, array-based binary search tree can be efficiently accessed and positioned based on their sorted order.
Sorted Order | Element | 0-Based Index | 1-Based Position |
---|---|---|---|
3rd | 3 | 8 | 9 |
7th | 7 | 10 | 11 |
Now let's generalize the problem. Suppose we want to find the position of the -th element in sorted order for a subtree rooted at position with a given height .
Define the function:
For example:
The typical method involves traversing the tree using a loop, utilizing the fact that the left and right child positions can be calculated as and , respectively. This approach has a time complexity of , where is the number of bits in the integer representation (commonly 32 bits).
We can achieve a more efficient solution with a time complexity of using a direct bitwise calculation:
The proposed approach avoids iterative traversal, greatly speeding up the computation, and you should complete the above. Explain whether the division (/
) operation could become a performance bottleneck in this solution. __ C02 __
C01 = ~(k - 1)
(或等效的形式)
: Isolates the least significant bit set in , which represents the smallest power of 2 divisor of .
C02 = (意思相近就給分,務必提到 power of 2)
Since the division operation is performed by a power of 2, the compiler optimizes it into a bit shift. This results in an efficient solution for finding the -th element in a left-complete, array-based binary tree.
D
We have several addressing modes for accessing memory (excluding the immediate mode):
lw
, lb
, sw
, and sb
).jalr
, jr
, and ret
, where jr
and ret
are pseudoinstructions translated into jalr
).What is the maximum range of 32-bit instructions that can be reached from the current PC using a jump instruction?
The range covers addresses within instructions from the current PC.
D01 =
D02 =
The immediate field for the
jal
instruction is 20 bits, while for thejalr
instruction, it is only 12 bits. This meansjal
can cover a wider range of addresses. Similar to the previous addressing mode, the 20-bit immediate value is multiplied by 2 and added to the program counter (PC) to determine the target address. Since the immediate value is signed, it spans a range of bytes, or 2-byte instructions. However, because we are dealing with 4-byte instructions, the effective range is instructions relative to the current PC.
Assume we are implementing a minimal RISC-V RV32I disassembler, which operates as follows:
You need to complete the program by filling in the specified placeholders, such as D03 and D04.
D03 = "jalr"
D04 = "lbu"
D05 = "lhu"
D06 = "slli"
D07 = "slti"
D08 = "sltiu"
D09 = "andi"
D10 = "ecall"
D11 = "csrrs"
D12 = "bltu"
D13 = "bgeu"
D14 = "lui"
D15 = "auipc"
D16 = "jal"
D17 = e.u.i31_12
D18 = e.j.i10_1 << 1
D19 = e.s.i11_5 << 5
D20 = e.b.i4_1 << 1
E
The critical path is the longest delay path between state elements in a circuit. The circuit’s clock speed cannot exceed this limit, as a faster clock would not allow sufficient time for the correct value to propagate to the state elements. By placing additional registers along the critical path, we can reduce the clock period by decreasing the amount of logic between registers.
The delay times for each circuit element are as follows:
For a single-cycle CPU, how long does it take to execute each instruction? Disregard the clock cycle duration based on the critical path, and assume that the setup times for both the register file (RegFile) and the program counter (PC) are identical.
sw
: __ E01 __ nslw
: __ E02 __ nsjal
: __ E03 __ nsE01 = 665
sw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + DMEM Setup
= 5ns + 300 ns + 60ns + 100ns + 200ns
= 665ns
E02 = 800
lw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + Mem-Read + Mux(WBSel) + RegFileSetup
= 5ns + 300ns + 60ns + 100ns + 300ns + 15ns + 20ns
= 800ns
E03 = 500
jal= clk-to-Q + Mem-Read + Imm-Gen + Mux(BSel) + ALU + Mux(PCSel) + PCSetup
= 5ns + 300ns + 45ns + 15ns + 100ns + 15ns + 20ns
= 500ns
The top diagram shows the standard single-cycle datapath, while the second diagram illustrates a modified version. Referring to the modified single-cycle datapath, consider the information that must be carried forward from one stage to the next. Which pipeline registers are necessary at the end of each stage?
E04 = PC + 4 注意: E04-E06 的順序可換
E05 = ALUOut
E06 = MEM
E07 = Inst
E08 = PC
E09 = ALUOut 注意: E09 和 E10 的順序可換
E10 = RegReadData2
E11 = Inst
F
For all questions, assume there is no branch prediction or double-pumping (i.e., no write-then-read in a single cycle for the RegFile). Most data hazards can be resolved through forwarding, where the result from the EX or MEM stage is sent directly to the EX stage for use by a subsequent instruction.
Note: How is forwarding (EX to EX or MEM to EX) implemented in hardware? We add two wires: one from the start of the MEM stage carrying the ALU output, and another from the start of the WB stage. Both wires connect to the A/B multiplexers in the EX stage.
Identify data hazards in the code below and determine how forwarding could be used to resolve them.
There are two data hazards: one between instructions __ F01 __ and __ F02 __ , and another between instructions __ F03 __ and 3. The first hazard can be resolved by forwarding the ALU output from the __ F04 __ stage in cycle C3 to the start of the __ F05 __ stage in cycle C4. The second hazard can be resolved by forwarding the ALU output from the __ F06 __ stage in cycle C4 to the start of the __ F07 __ stage in cycle C5.
F01 = 1
F02 = 2
F03 = 1
F04 = MEM
F05 = EX
F06 = WB
F07 = EX
Identify the data hazards in the code below. One of them cannot be resolved using forwarding—why not? How can we address this issue?
There are two data hazards present in the code:
t0
.t1
.The hazard between instructions __ F12 __ and __ F13 __ can be resolved using forwarding. However, the hazard between instructions __ F14 __ and __ F15 __ cannot be resolved with forwarding. This is because instruction __ F16 __ requires the result of instruction __ F17 __ at the beginning of cycle C6, but the result is not available until the end of cycle C6.
To resolve this, we can introduce a stall by inserting a nop
(no-operation) instruction between instructions __ F18 __ and __ F19 __.
F08 = 2
F09 = 3
F10 = 3
F11 = 4
F12 = 2
F13 = 3
F14 = 3
F15 = 4
F16 = 4
F17 = 3
F18 = 3
F19 = 4
Imagine you are the compiler and can reorder instructions to minimize data hazards while still producing the same output. How would you modify the code? Reorder the instructions as __ F20 __ , since instruction 1 has no dependencies and can be moved without affecting the result.
F20 = 2-3-1-4
For the RISC-V code provided below and a pipelined CPU without forwarding, how many data hazards would occur? __ F21 __
F21 = 4
There are four hazards: a data hazard from
t1
between instructions 1 and 2, a data hazard froms0
between instructions 2 and 3, another data hazard froms0
between instructions 2 and 4, and a control hazard between instructions 4 and 5.
How many stalls would be required to resolve the data hazard(s) if the RegFile supports double-pumping (i.e., write-then-read capability)? __ F22 __
F22 = 2
Assuming the RegFile allows simultaneous read and write operations in the same cycle, two stalls are needed between instructions 1 and 2 (WB → ID), and another two stalls are needed between instructions 2 and 3 (WB → ID).
G
In Chisel, hardware components are referred to as modules. Each module extends the Module
class and includes a field named io
for defining its interface. The interface is specified using a Bundle
, which is wrapped with a call to IO()
. The Bundle
contains fields representing the input and output ports of the module. The direction of these ports is specified using Input()
or Output()
, based on the perspective of the component itself.
We will demonstrate this with a simple design where a counter is constructed using two components: an adder and a register. Next, we create a counter using these two components that counts from 0 to 9 and then resets. The schematic of the counter is depicted below.
An adder is used to increment the current counter value by 1. A multiplexer selects between the incremented value and 0. The result of this selection, referred to as next
, is used as the input to the register component. The output of the register represents the current count value and also serves as the output of the Count10
module (dout
).
The following shows the Chisel code for the Count10
module. The two components are instantiated using new
, wrapped with Module()
, and assigned the names add
and reg
. The register's output (reg.io.q
) is given the name count
.
We connect 1.U
and count
to the inputs of the adder component. The adder's output is named result
. The multiplexer selects between 0.U
and result
based on the current counter value (count
). The output of the multiplexer, named next
, is connected to the input of the register component. Finally, we connect the counter value (count
) to the output of the Count10
module (io.dout
).
Complete the above to make the design work.
G01 = reg.io.q
G02 = Mux(count === 9.U, 0.U, result)
G03 = reg.io.d := next
A key component in circuits that perform computation, such as a microprocessor, is the Arithmetic Logic Unit (ALU). The figure below depicts the symbol of an ALU.
The ALU has two data inputs, labeled a
and b
, one function input labeled fn
, and an output labeled y
. The ALU processes inputs a
and b
and produces the result at the output y
. The fn
input determines which operation is applied to a
and b
. Typical operations include arithmetic functions like addition and subtraction, as well as logical functions like AND, OR, and XOR, hence the name ALU
.
The ALU is typically a combinational circuit with no state elements. It may also include additional outputs to indicate specific properties of the result, such as whether the result is zero or its sign.
The following code defines an ALU with 16-bit inputs and outputs, supporting addition, subtraction, OR, and AND operations, selected by a 2-bit fn
signal.
G04 = val y = Output(UInt (16.W))
G05 = io.y := io.a & io.b
H
Examine the following RISC-V assembly code for the floating-point addition function (fadd
), designed for single-precision floating-point addition using custom bitwise operations. Your task is to complete the code so that it functions as intended. This version includes simplified rounding logic, which may need enhancements for full IEEE 754 compliance.
H01 = srli t2, a0, 31
H02 = srli t2, a1, 31
H03 = bgeu a2, t0, overflow_xe_gte_ye
H04 = srli a4, a4, 9
H05 = slli t2, t2, 31
H06 = 0x7F800000
H07 = slli t2, t2, 31
Consider the following implementation of converting a signed 32-bit integer to a single-precision floating-point number (IEEE 754 format) using the RISC-V RV32I assembly. The focus is on understanding the low-level operations required to perform this conversion without relying on pseudo-instructions or extended instruction sets.
H08 = sll a1, a1, t1
H09 = slli a1, a1, 10
H10 = slli a0, a0, 23