---
tags: computer-arch
---
# Quiz4 of Computer Architecture (2024 Fall)
> Solutions
## Problem `A`
Implement a RISC-V function called `sum_of_squares` that, given an integer `n`, calculates the following sum:
$$
n^2 + (n - 1)^2 + (n - 2)^2 + \ldots + 1^2
$$
If $n$ is not a positive integer (`n ≤ 0`), the function should return the value of $m$.
This task requires a recursive algorithm that uses the `a1` register to keep track of the accumulated sum. You will implement the following function:
- [ ] Function: `sum_of_squares`
- [ ] Input:
* `a0`: A 32-bit integer representing `n`. You can assume that `n ≤ 10000`.
* `a1`: A 32-bit integer representing `m`.
- [ ] Output:
* The function should return in `a0` the result of:
$$
m + n^2 + (n - 1)^2 + (n - 2)^2 + \ldots + 1^2
$$
* If `n ≤ 0`, return the value of `m`.
You have access to a helper function `square` that computes the square of an integer.
- [ ] Function: `square`
- [ ] Input:
* `a0`: An integer `n`.
- [ ] Output:
* `a0` will hold the value of $n^2$.
To implement the recursive logic, consider updating the value of `m` at each step. Specifically, let:
$$
m' = m + n^2
$$
Then, the original sum:
$$
m + n^2 + (n - 1)^2 + \ldots + 1^2
$$
can be rewritten as:
$$
m' + (n - 1)^2 + \ldots + 1^2
$$
This transformation allows you to reduce the problem size by decrementing $n$ and updating $m$ with the new accumulated sum ($m'$). Use this approach to implement the recursive function and follow the RISC-V calling convention.
```c
sum_of_squares:
# Check if n (a0) is less than or equal to zero
bgt a0, x0, recurse_case # __ A01 __
zero_case:
# If n ≤ 0, return m (a1)
add a0, a1, x0
jalr x0, ra, 0 # __ A02 __
recurse_case:
# Save caller-saved registers on the stack
add t0, a0, x0 # t0 = a0 (copy n)
addi sp, sp, -12 # Allocate stack space __ A03 __
sw a1, 0(sp) # Save a1 (m)
sw t0, 4(sp) # Save t0 (n)
sw ra, 8(sp) # Save return address __ A04 __
# Call the square function
jal ra, square # __ A05 __
# Restore registers and stack
lw a1, 0(sp) # Restore a1 (m)
lw t0, 4(sp) # Restore t0 (n)
lw ra, 8(sp) # Restore return address __ A06 __
addi sp, sp, 12 # Deallocate stack space __ A07 __
# Update m = m + n^2
add a1, a1, a0
# Decrement n: a0 = n - 1
addi a0, t0, -1
# Recursive call to sum_of_squares
addi sp, sp, -4 # Allocate stack space for ra __ A08 __
sw ra, 0(sp) # Save return address
jal ra, sum_of_squares # __ A09 __
lw ra, 0(sp) # Restore return address
addi sp, sp, 4 # Deallocate stack space __ A10 __
# Return from the function
jalr x0, ra, 0 # __ A11 __
```
You are required to complete the RISC-V assembly code above, ensuring that no pseudoinstructions are used.
A01 = `blt x0, a0, recurse_case`
A02 = `jalr x0, ra, 0`
A03 = `addi sp, sp, -12`
A04 = `sw ra, 8(sp)`
A05 = `jal ra, square`
A06 = `lw ra, 8(sp)`
A07 = `addi sp, sp, 12`
A08 = `addi sp, sp, -4`
A09 = `jal ra, sum_of_squares`
A10 = `addi sp, sp, 4`
A11 = `jalr x0, ra, 0`
From the perspective of the callee, is it necessary to save any registers in the prologue and epilogue of the function? If so, which registers need to be saved, and where should the prologue and epilogue be placed? If not, provide a brief explanation as to why. __ A12 __
A12 = (一定要有 No 或 Not,意思相近就給分)
> No, we do not need to consider callee-saved registers because we are not using any in this function. However, since the function makes two function calls, it may be necessary to save the `ra` register in the prologue and restore it in the epilogue, right before the `jalr x0, ra, 0` instruction, just before the `zero_case` label.
Next, we will implement the `square` function, which calculates the square of an integer (`a0 = n`) and returns the result in `a0`. This function adheres to the RISC-V calling convention and avoids using RV32M instructions, ensuring compatibility with the RV32I base ISA.
```c
# Function: square
# Computes the square of an integer (a0 = n), returns result in a0
square:
addi sp, sp, -8 # Allocate stack space
sw ra, 0(sp) # Save return address __ A13 __
add t0, x0, x0 # t0 = 0 (accumulator for the result)
add t1, a0, x0 # t1 = a0 (copy of n, multiplicand)
add t2, a0, x0 # t2 = a0 (copy of n, multiplier)
square_loop:
and t3, t2, x1 # Check the lowest bit of t2 (t2 & 1) __ A14 __
beq t3, x0, skip_add # If the bit is 0, skip addition
add t0, t0, t1 # Accumulate: t0 += t1
skip_add:
sll t1, t1, 1 # Left shift t1 (multiply by 2) __ A15 __
srl t2, t2, 1 # Right shift t2 (divide by 2) __ A16 __
bne t2, x0, square_loop # Repeat loop if t2 is not zero __ A17 __
square_end:
add a0, t0, x0 # Move result to a0
lw ra, 0(sp) # Restore return address __ A18 __
addi sp, sp, 8 # Deallocate stack space
jalr x0, ra, 0 # Return from function __ A19 __
```
A13 = `sw ra, 0(sp)`
A14 = `and t3, t2, x1`
A15 = `slli t1, t1, 1`
A16 = `srli t2, t2, 1`
A17 = `bne t2, x0, square_loop`
A18 = `lw ra, 0(sp)`
A19 = `jalr x0, ra, 0`
---
## Problem `B`
Similar to logic gates, registers also experience a delay before their output reflects the sampled input. This delay is called the clock-to-Q (clk-to-q) delay, where "Q" typically denotes the output. The clk-to-q delay is the time between the rising edge of the clock signal and when the register's output updates to match the input.
In the circuit below:
- Registers RegA and RegB have setup, hold, and clk-to-q times of 4 ns each.
- Each logic gate has a delay of 5 ns.
- Register RegC has a setup time of 6 ns.
![circuit-3](https://hackmd.io/_uploads/ryqXUgYzJe.png)
Determine:
1. What is the maximum allowable hold time for RegC? __ B01 __ ns
2. What is the minimum acceptable clock cycle time for this circuit? __ B02 __ ns
3. What clock frequency does this correspond to? __ B03 __ HZ
B01 = 14
B02 = 25
B03 = 40M
> The maximum allowable hold time for RegC is determined by the time it takes for the input of RegC to change. This is calculated as the clk-to-q delay of RegA or RegB plus the shortest combinational logic delay:
$$
4 \, \text{ns} + (5 \, \text{ns} + 5 \, \text{ns}) = 14 \, \text{ns}
$$
>
> The minimum clock cycle time must account for the clk-to-q delay, the longest combinational logic delay, and the setup time of RegC:
$$
4 \, \text{ns} + (5 \, \text{ns} + 5 \, \text{ns} + 5 \, \text{ns}) + 6 \, \text{ns} = 25 \, \text{ns}
$$
>
> A clock cycle time of 25 ns corresponds to a clock frequency of:
$$
\frac{1}{25 \times 10^{-9} \, \text{s}} = 40 \, \text{MHz}
$$
---
## Problem `C`
Consider a left-complete binary search tree represented as an array. For example:
```c
nodes = [8, 4, 12, 2, 6, 10, 14, 1, 3, 5, 7, 9, 11, 13, 15]
```
The corresponding binary tree structure is:
```
8 (index 0)
4 (index 1) 12 (index 2)
2 (index 3) 6 (index 4) 10 (index 5) 14 (index 6)
1 (index 7) 3 (index 8) 5 (index 9) 7 (index 10) 9 (index 11) 11 (index 12) 13 (index 13) 15 (index 14)
```
To find the position of elements in sorted order, we perform an **in-order traversal** (left-root-right) on this tree. The sorted order is:
```
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
```
Let's go through two specific examples:
1. 3rd Element in Sorted Order:
- The 3rd element in the sorted order is **3**.
- In the array representation (`nodes`), **3** is at **index 8** (0-based).
- In 1-based indexing, this corresponds to **position 9**.
2. 7th Element in Sorted Order:
- The 7th element in the sorted order is **7**.
- In the array representation (`nodes`), **7** is at **index 10** (0-based).
- In 1-based indexing, this corresponds to **position 11**.
This example demonstrates how elements in a left-complete, array-based binary search tree can be efficiently accessed and positioned based on their sorted order.
| Sorted Order | Element | 0-Based Index | 1-Based Position |
|--------------|---------|---------------|------------------|
| 3rd | 3 | 8 | 9 |
| 7th | 7 | 10 | 11 |
Now let's generalize the problem. Suppose we want to find the position of the $k$-th element in sorted order for a subtree rooted at position $p$ with a given height $h$.
Define the function:
```c
k_child(p, h, k)
```
- $p$: Position of the root of the subtree (1-based indexing).
- $h$: Height of the subtree (a tree with a single node has height 0).
- $k$: The rank of the desired element in sorted order.
For example:
- For the subtree rooted at position $p = 3$ (element **12**) with height $h = 2$, the 3rd element in sorted order is at position **13** (element **11**).
- For $p = 3$, $h = 1$, and $k = 3$, the 3rd element is at position **7** (element **7**).
The typical method involves traversing the tree using a loop, utilizing the fact that the left and right child positions can be calculated as $2 \times p$ and $2 \times p + 1$, respectively. This approach has a time complexity of $O(b^2)$, where $b$ is the number of bits in the integer representation (commonly 32 bits).
We can achieve a more efficient solution with a time complexity of $O(b)$ using a direct bitwise calculation:
```c
#define k_child(p, h, k) (((p << h) | (k >> 1)) / (k & (~(k - 1))))
```
1. $p \ll h$: Shifts the position $p$ left by $h$ bits, effectively scaling the root position to the height of the subtree.
2. $k \gg 1$: Shifts $k$ right by 1 bit, halving it and adjusting for 0-based indexing.
The proposed approach avoids iterative traversal, greatly speeding up the computation, and you should complete the above. Explain whether the division (`/`) operation could become a performance bottleneck in this $O(b)$ solution. __ C02 __
C01 = `~(k - 1)` (或等效的形式)
> $k \& (~(k - 1))$: Isolates the least significant bit set in $k$, which represents the smallest power of 2 divisor of $k$.
C02 = (意思相近就給分,務必提到 power of 2)
> Since the division operation is performed by a power of 2, the compiler optimizes it into a bit shift. This results in an efficient $O(b)$ solution for finding the $k$-th element in a left-complete, array-based binary tree.
---
## Problem `D`
We have several addressing modes for accessing memory (excluding the immediate mode):
1. Base Displacement Addressing: Adds an immediate value to a register's value to form a memory address (used by instructions like `lw`, `lb`, `sw`, and `sb`).
2. PC-Relative Addressing: Uses the current value of the program counter (PC) and adds the immediate value (multiplied by 2) from the instruction to form a target address (used by branch and jump instructions).
3. Register Addressing: Uses the value in a register as the target address (e.g., `jalr`, `jr`, and `ret`, where `jr` and `ret` are pseudoinstructions translated into `jalr`).
What is the maximum range of 32-bit instructions that can be reached from the current PC using a jump instruction?
The range covers addresses within $[-2^{18}, 2^{18} - 1]$ instructions from the current PC.
D01 = $-2^{18}$
D02 = $2^{18} - 1$
> The immediate field for the `jal` instruction is 20 bits, while for the `jalr` instruction, it is only 12 bits. This means `jal` can cover a wider range of addresses. Similar to the previous addressing mode, the 20-bit immediate value is multiplied by 2 and added to the program counter (PC) to determine the target address. Since the immediate value is signed, it spans a range of $[-2^{20}, 2^{20} - 2]$ bytes, or $[-2^{19}, 2^{19} - 1]$ 2-byte instructions. However, because we are dealing with 4-byte instructions, the effective range is $[-2^{18}, 2^{18} - 1]$ instructions relative to the current PC.
Assume we are implementing a minimal RISC-V RV32I disassembler, which operates as follows:
```
Disassembling raw binary: jalr.bin
0: 00000097 auipc x1, 0x0
4: 00808093 addi x1, x1, 8
8: 06428293 addi x5, x5, 100
c: 00008067 jalr x0, x1, 0
```
You need to complete the program by filling in the specified placeholders, such as D03 and D04.
```c
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef uint8_t u8;
typedef uint16_t u16;
typedef uint32_t u32;
typedef uint64_t u64;
typedef int32_t s32;
/* stream of 32 bit fixed length instructions */
struct progbits {
u32 size;
u32 *data;
};
static struct progbits loadbits_raw(char *filename)
{
FILE *f = fopen(filename, "rb");
unsigned int size;
fseek(f, 0, SEEK_END);
size = ftell(f);
fseek(f, 0, SEEK_SET);
u32 *text = malloc(size);
fread(text, size, 1, f);
fclose(f);
return (struct progbits) {size / sizeof(u32), text};
}
enum format { R, I, S, B, U, J };
struct {
u8 opcode; /* 7-bit */
enum format fmt;
} opcodefmt[] = {
{0x37, U}, {0x17, U}, {0x6f, J}, {0x67, I}, {0x63, B}, {0x03, I},
{0x23, S}, {0x13, I}, {0x33, R}, {0x0f, I}, {0x73, I},
};
union encoding {
u32 insn;
struct { /* generic */
u32 opcode : 7;
u32 rd : 5;
u32 funct3 : 3;
u32 rs1 : 5;
u32 rs2 : 5;
u32 funct7 : 7;
};
struct {
u32 opcode : 7;
u32 rd : 5;
u32 funct3 : 3;
u32 rs1 : 5;
u32 rs2 : 5;
u32 funct7 : 7;
} r;
struct {
u32 opcode : 7;
u32 rd : 5;
u32 funct3 : 3;
u32 rs1 : 5;
s32 i11_0 : 12; /* sign extension */
} i;
struct {
u32 opcode : 7;
u32 i4_0 : 5;
u32 funct3 : 3;
u32 rs1 : 5;
u32 rs2 : 5;
s32 i11_5 : 7; /* sign extension */
} s;
struct {
u32 opcode : 7;
u32 i11 : 1;
u32 i4_1 : 4;
u32 funct3 : 3;
u32 rs1 : 5;
u32 rs2 : 5;
u32 i10_5 : 6;
s32 i12 : 1; /* sign extension */
} b;
struct {
u32 opcode : 7;
u32 rd : 5;
u32 i31_12 : 20;
} u;
struct {
u32 opcode : 7;
u32 rd : 5;
u32 i19_12 : 8;
u32 i11 : 1;
u32 i10_1 : 10;
s32 i20 : 1; /* sign extension */
} j;
};
static int format(u8 opcode)
{
for (int i = 0, n = sizeof opcodefmt / sizeof opcodefmt[0]; i < n; ++i)
if (opcode == opcodefmt[i].opcode) return opcodefmt[i].fmt;
return -1;
}
static char *iname(u32 insn)
{
union encoding e = {insn};
switch (format(e.opcode)) {
case R:
switch (e.funct3) {
case 0: return e.funct7 ? "sub" : "add";
case 1: return "sll";
case 2: return "slt";
case 3: return "sltu";
case 4: return "xor";
case 5: return e.funct7 ? "sra" : "srl";
case 6: return "or";
case 7: return "and";
}
break;
case I:
switch (e.opcode) {
case 0x67: return "jalr"; /* D03 */
case 0x03:
switch (e.funct3) {
case 0: return "lb";
case 1: return "lh";
case 2: return "lw";
case 4: return "lbu"; /* D04 */
case 5: return "lhu"; /* D05 */
}
break;
case 0x13:
switch (e.funct3) {
case 0: return "addi";
case 1: return "slli"; /* D06 */
case 2: return "slti"; /* D07 */
case 3: return "sltiu"; /* D08 */
case 4: return "xori";
case 5: return e.funct7 ? "srai" : "srli";
case 6: return "ori";
case 7: return "andi"; /* D09 */
}
break;
case 0x0f:
switch (e.funct3) {
case 0: return "fence";
case 1: return "fence.i";
}
break;
case 0x73:
switch (e.funct3) {
case 0: return e.rs2 ? "ebreak" : "ecall"; /* D10 */
case 1: return "csrrw";
case 2: return "csrrs"; /* D11 */
case 3: return "csrrc";
case 5: return "csrrwi";
case 6: return "csrrsi";
case 7: return "csrrci";
}
break;
}
break;
case S:
switch (e.funct3) {
case 0: return "sb";
case 1: return "sh";
case 2: return "sw";
}
break;
case B:
switch (e.funct3) {
case 0: return "beq";
case 1: return "bne";
case 4: return "blt";
case 5: return "bge";
case 6: return "bltu"; /* D12 */
case 7: return "bgeu"; /* D13 */
}
break;
case U:
switch (e.opcode) {
case 0x37: return "lui"; /* D14 */
case 0x17: return "auipc"; /* D15 */
}
break;
case J: return "jal"; /* D16 */
}
return NULL;
}
static char *op0(u32 insn)
{
union encoding e = {insn};
char *name = calloc(16 + 1, sizeof *name);
switch (format(e.opcode)) {
case R:
case I: sprintf(name, "x%d", e.rd); break;
case S: sprintf(name, "x%d", e.rs2); break;
case B: sprintf(name, "x%d", e.rs1); break;
case U:
case J: sprintf(name, "x%d", e.rd); break;
}
return name;
}
static char *op1(u32 insn)
{
union encoding e = {insn};
char *name = calloc(16 + 1, sizeof *name);
switch (format(e.opcode)) {
case R:
case I:
case S: sprintf(name, "x%d", e.rs1); break;
case B: sprintf(name, "x%d", e.rs2); break;
case U: sprintf(name, "0x%x", e.u.i31_12); break; /* D17 */
case J:
sprintf(name, "%d",
(e.j.i20 << 20) | (e.j.i19_12 << 12) | (e.j.i11 << 11) |
(e.j.i10_1 << 1)); /* D18 */
break;
}
return name;
}
static char *op2(u32 insn)
{
union encoding e = {insn};
char *name = calloc(16 + 1, sizeof *name);
switch (format(e.opcode)) {
case R: sprintf(name, "x%d", e.rs2); break;
case I: sprintf(name, "%d", e.i.i11_0); break;
case S: sprintf(name, "%d", (e.s.i11_5 << 5) | e.s.i4_0); break; /* D19 */
case B:
sprintf(name, "%d",
(e.b.i12 << 12) | (e.b.i11 << 11) | (e.b.i10_5 << 5) |
(e.b.i4_1 << 1)); /* D20 */
break;
case U: break;
case J: break;
}
return name;
}
int main(int argc, char **argv)
{
if (argc < 2) return 1;
struct progbits pb;
printf("disassembling raw binary %s\n", argv[1]);
pb = loadbits_raw(argv[1]);
for (u32 i = 0, n = pb.size, insn = pb.data[i]; i < n;
++i, insn = pb.data[i])
printf("%4x: %08x %-8s %3s %3s %3s\n", 4 * i, insn,
iname(insn), op0(insn), op1(insn), op2(insn));
return 0;
}
```
D03 = `"jalr"`
D04 = `"lbu"`
D05 = `"lhu"`
D06 = `"slli"`
D07 = `"slti"`
D08 = `"sltiu"`
D09 = `"andi"`
D10 = `"ecall"`
D11 = `"csrrs"`
D12 = `"bltu"`
D13 = `"bgeu"`
D14 = `"lui"`
D15 = `"auipc"`
D16 = `"jal"`
D17 = `e.u.i31_12`
D18 = `e.j.i10_1 << 1`
D19 = `e.s.i11_5 << 5`
D20 = `e.b.i4_1 << 1`
---
## Problem `E`
The critical path is the longest delay path between state elements in a circuit. The circuit’s clock speed cannot exceed this limit, as a faster clock would not allow sufficient time for the correct value to propagate to the state elements. By placing additional registers along the critical path, we can reduce the clock period by decreasing the amount of logic between registers.
The delay times for each circuit element are as follows:
![latency](https://hackmd.io/_uploads/r1-02Mtf1e.png)
![CPU](https://hackmd.io/_uploads/rkmq0zKfkx.png)
For a single-cycle CPU, how long does it take to execute each instruction? Disregard the clock cycle duration based on the critical path, and assume that the setup times for both the register file (RegFile) and the program counter (PC) are identical.
1. `sw` : __ E01 __ ns
2. `lw` : __ E02 __ ns
3. `jal` : __ E03 __ ns
E01 = 665
> sw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + DMEM Setup
= 5ns + 300 ns + 60ns + 100ns + 200ns
= 665ns
E02 = 800
> lw = clk-to-Q + Mem-Read + max(RegFileRead + Mux(ASel), Imm-Gen +
Mux(BSel)) + ALU + Mem-Read + Mux(WBSel) + RegFileSetup
= 5ns + 300ns + 60ns + 100ns + 300ns + 15ns + 20ns
= 800ns
E03 = 500
> jal= clk-to-Q + Mem-Read + Imm-Gen + Mux(BSel) + ALU + Mux(PCSel) + PCSetup
= 5ns + 300ns + 45ns + 15ns + 100ns + 15ns + 20ns
= 500ns
![modified](https://hackmd.io/_uploads/BkWGlQtzkl.png)
The top diagram shows the standard single-cycle datapath, while the second diagram illustrates a modified version. Referring to the modified single-cycle datapath, consider the information that must be carried forward from one stage to the next. Which pipeline registers are necessary at the end of each stage?
- [ ] MEM to WB:
* __ E04 __: input into WBSel mux.
* __ E05 __: input into WBSel mux.
* __ E06 __: input into WBSel mux.
* __ E07 __: input into next stage’s control logic.
E04 = PC + 4 注意: E04-E06 的順序可換
E05 = ALUOut
E06 = MEM
E07 = Inst
- [ ] EX to MEM:
* __ E08 __: input into the +4 block in the MEM stage.
* __ E09 __: is an input into DMEM.
* __ E10 __: is an input into DMEM,
* __ E11 __: input into next stage’s control logic.
E08 = PC
E09 = ALUOut 注意: E09 和 E10 的順序可換
E10 = RegReadData2
E11 = Inst
---
## Problem `F`
For all questions, assume there is no branch prediction or double-pumping (i.e., no write-then-read in a single cycle for the RegFile). Most data hazards can be resolved through forwarding, where the result from the EX or MEM stage is sent directly to the EX stage for use by a subsequent instruction.
Note: How is forwarding (EX to EX or MEM to EX) implemented in hardware? We add two wires: one from the start of the MEM stage carrying the ALU output, and another from the start of the WB stage. Both wires connect to the A/B multiplexers in the EX stage.
Identify data hazards in the code below and determine how forwarding could be used to resolve them.
![hazard](https://hackmd.io/_uploads/HyXzGQYz1g.png)
There are two data hazards: one between instructions __ F01 __ and __ F02 __ , and another between instructions __ F03 __ and 3. The first hazard can be resolved by forwarding the ALU output from the __ F04 __ stage in cycle C3 to the start of the __ F05 __ stage in cycle C4. The second hazard can be resolved by forwarding the ALU output from the __ F06 __ stage in cycle C4 to the start of the __ F07 __ stage in cycle C5.
F01 = 1
F02 = 2
F03 = 1
F04 = MEM
F05 = EX
F06 = WB
F07 = EX
Identify the data hazards in the code below. One of them cannot be resolved using forwarding—why not? How can we address this issue?
![hazard](https://hackmd.io/_uploads/Bky5m7KGJl.png)
There are two data hazards present in the code:
1. The first hazard is between instructions __ F08 __ and __ F09 __ , involving the register `t0`.
2. The second hazard is between instructions __ F10 __ and __ F11 __ , involving the register `t1`.
The hazard between instructions __ F12 __ and __ F13 __ can be resolved using forwarding. However, the hazard between instructions __ F14 __ and __ F15 __ cannot be resolved with forwarding. This is because instruction __ F16 __ requires the result of instruction __ F17 __ at the beginning of cycle C6, but the result is not available until the end of cycle C6.
To resolve this, we can introduce a stall by inserting a `nop` (no-operation) instruction between instructions __ F18 __ and __ F19 __.
F08 = 2
F09 = 3
F10 = 3
F11 = 4
F12 = 2
F13 = 3
F14 = 3
F15 = 4
F16 = 4
F17 = 3
F18 = 3
F19 = 4
Imagine you are the compiler and can reorder instructions to minimize data hazards while still producing the same output. How would you modify the code? Reorder the instructions as __ F20 __ , since instruction 1 has no dependencies and can be moved without affecting the result.
F20 = 2-3-1-4
For the RISC-V code provided below and a pipelined CPU without forwarding, how many data hazards would occur? __ F21 __
![pipeline](https://hackmd.io/_uploads/BJF2SXKf1e.png)
F21 = 4
> There are four hazards: a data hazard from `t1` between instructions 1 and 2, a data hazard from `s0` between instructions 2 and 3, another data hazard from `s0` between instructions 2 and 4, and a control hazard between instructions 4 and 5.
How many stalls would be required to resolve the data hazard(s) if the RegFile supports double-pumping (i.e., write-then-read capability)? __ F22 __
F22 = 2
> Assuming the RegFile allows simultaneous read and write operations in the same cycle, two stalls are needed between instructions 1 and 2 (WB → ID), and another two stalls are needed between instructions 2 and 3 (WB → ID).
---
## Problem `G`
In Chisel, hardware components are referred to as **modules**. Each module extends the `Module` class and includes a field named `io` for defining its interface. The interface is specified using a `Bundle`, which is wrapped with a call to `IO()`. The `Bundle` contains fields representing the input and output ports of the module. The direction of these ports is specified using `Input()` or `Output()`, based on the perspective of the component itself.
We will demonstrate this with a simple design where a counter is constructed using two components: an adder and a register. Next, we create a counter using these two components that counts from 0 to 9 and then resets. The schematic of the counter is depicted below.
![chisel](https://hackmd.io/_uploads/rJaRwQKGkl.png)
An adder is used to increment the current counter value by 1. A multiplexer selects between the incremented value and 0. The result of this selection, referred to as `next`, is used as the input to the register component. The output of the register represents the current count value and also serves as the output of the `Count10` module (`dout`).
The following shows the Chisel code for the `Count10` module. The two components are instantiated using `new`, wrapped with `Module()`, and assigned the names `add` and `reg`. The register's output (`reg.io.q`) is given the name `count`.
```scala
class Adder extends Module {
val io = IO(new Bundle {
val a = Input(UInt (8.W))
val b = Input(UInt (8.W))
val y = Output(UInt (8.W))
})
io.y := io.a + io.b
}
class Register extends Module {
val io = IO(new Bundle {
val d = Input(UInt (8.W))
val q = Output(UInt (8.W))
})
val reg = RegInit (0.U)
reg := io.d
io.q := reg
}
class Count10 extends Module {
val io = IO(new Bundle {
val dout = Output(UInt (8.W))
})
val add = Module(new Adder ())
val reg = Module(new Register ())
// the register output
val count = reg.io.q // G01
// connect the adder
add.io.a := 1.U
add.io.b := count
val result = add.io.y
// connect the Mux and the register input
val next = Mux(count === 9.U, 0.U, result) // G02
reg.io.d := next // G03
io.dout := count
}
```
We connect `1.U` and `count` to the inputs of the adder component. The adder's output is named `result`. The multiplexer selects between `0.U` and `result` based on the current counter value (`count`). The output of the multiplexer, named `next`, is connected to the input of the register component. Finally, we connect the counter value (`count`) to the output of the `Count10` module (`io.dout`).
Complete the above to make the design work.
G01 = `reg.io.q`
G02 = `Mux(count === 9.U, 0.U, result)`
G03 = `reg.io.d := next`
A key component in circuits that perform computation, such as a microprocessor, is the Arithmetic Logic Unit (ALU). The figure below depicts the symbol of an ALU.
![ALU](https://hackmd.io/_uploads/S1ofqmKMyg.png)
The ALU has two data inputs, labeled `a` and `b`, one function input labeled `fn`, and an output labeled `y`. The ALU processes inputs `a` and `b` and produces the result at the output `y`. The `fn` input determines which operation is applied to `a` and `b`. Typical operations include arithmetic functions like addition and subtraction, as well as logical functions like AND, OR, and XOR, hence the name `ALU`.
The ALU is typically a combinational circuit with no state elements. It may also include additional outputs to indicate specific properties of the result, such as whether the result is zero or its sign.
The following code defines an ALU with 16-bit inputs and outputs, supporting addition, subtraction, OR, and AND operations, selected by a 2-bit `fn` signal.
```scala
import chisel3.util._
class Alu extends Module {
val io = IO(new Bundle {
val a = Input(UInt (16.W))
val b = Input(UInt (16.W))
val fn = Input(UInt (2.W))
val y = Output(UInt (16.W)) // G04
})
// some default value is needed
io.y := 0.U
// The ALU selection
switch(io.fn) {
is (0.U) { io.y := io.a + io.b }
is (1.U) { io.y := io.a - io.b }
is (2.U) { io.y := io.a | io.b }
is (3.U) { io.y := io.a & io.b } // G05
}
}
```
G04 = `val y = Output(UInt (16.W))`
G05 = `io.y := io.a & io.b`
---
## Problem `H`
Examine the following RISC-V assembly code for the floating-point addition function (`fadd`), designed for single-precision floating-point addition using custom bitwise operations. Your task is to complete the code so that it functions as intended. This version includes simplified rounding logic, which may need enhancements for full IEEE 754 compliance.
```c
fadd:
# Unpack exponent of X
srl a2, a0, 23 # A2 = A0 >> 23
andi a2, a2, 0xFF # A2 = A2 & 0xFF (Exponent of X)
# Unpack exponent of Y
srl a3, a1, 23 # A3 = A1 >> 23
andi a3, a3, 0xFF # A3 = A3 & 0xFF (Exponent of Y)
# Flush-to-zero: if X exponent is zero, return Y
beqz a2, __addsf_return_y_flushed
# If Y exponent is zero, return X
beqz a3, __addsf_return_x
# Unpack significands and prepare for alignment
slli a4, a0, 9 # A4 = A0 << 9
slli a5, a1, 9 # A5 = A1 << 9
# Check for NaN or Infinity
li t0, 255
beq a2, t0, __addsf_x_nan_inf
beq a3, t0, __addsf_y_nan_inf
# Finish unpacking significands
srli a4, a4, 6 # A4 = A4 >> 6
srli a5, a5, 6 # A5 = A5 >> 6
# Restore the implicit '1' bit
li t1, 1 << (23 + 3) # t1 = 1 << 26
or a4, a4, t1 # A4 |= t1
or a5, a5, t1 # A5 |= t1
# Negate significand of X if sign bit is set
srl t2, a0, 31 # t2 = A0 >> 31, H01
andi t2, t2, 1
beq t2, x0, skip_neg_x
neg a4, a4 # A4 = -A4
skip_neg_x:
# Store 25 in t1
li t1, 25
# Negate significand of Y if sign bit is set
srl t2, a1, 31 # t2 = A1 >> 31, H02
andi t2, t2, 1
beq t2, x0, skip_neg_y
neg a5, a5 # A5 = -A5
skip_neg_y:
# Compare exponents
bltu a2, a3, __addsf_ye_gt_xe # If A2 < A3, swap operands
# Proceed with addition where exponent of X >= exponent of Y
__addsf_xe_gte_ye:
# Compute exponent difference
sub t3, a2, a3 # t3 = exp_lhs - exp_rhs
# Check if exponent difference is too large
bgeu t3, t1, __addsf_return_x # If difference >= 25, return X
# Shift significand of Y (rhs) right by exponent difference
sra a5, a5, t3 # A5 = A5 >> t3
# Set sticky bit if bits were shifted out
sll t4, a5, t3 # t4 = A5 << t3
sltu t4, t4, a5 # t4 = (t4 < A5) ? 1 : 0
or a5, a5, t4 # A5 |= t4
# Add significands
add a4, a4, a5 # A4 = A4 + A5
# Check for zero result
beqz a4, __addsf_return_0
# Convert from two's complement to sign-magnitude
srai t2, a4, 31 # t2 = Sign of result
xor a4, a4, t2 # If negative, invert bits
sub a4, a4, t2 # If negative, add 1
# Normalize the result
li t5, 0 # t5 will count leading zeros
norm_loop_xe_gte_ye:
sll t4, a4, 1 # t4 = A4 << 1
bltz t4, norm_done_xe_gte_ye # If MSB is 1, done
addi t5, t5, 1 # Increment leading zero count
sll a4, a4, 1 # Shift A4 left
j norm_loop_xe_gte_ye
norm_done_xe_gte_ye:
sub a2, a2, t5 # Adjust exponent
# Check for underflow or overflow
bgeu a2, t0, overflow_xe_gte_ye # If exponent >= 255, overflow, H03
bltz a2, underflow_xe_gte_ye # If exponent < 0, underflow
# Pack the result
slli a2, a2, 23 # Shift exponent to position
srli a4, a4, 9 # Align significand, H04
or a0, a2, a4 # Combine exponent and significand
# Restore sign
sll t2, t2, 31 # Shift sign bit back, H05
or a0, a0, t2 # Set sign bit
# Return
jalr x0, ra, 0
# Handle underflow
underflow_xe_gte_ye:
add a0, x0, x0 # Return zero
jalr x0, ra, 0
# Handle overflow
overflow_xe_gte_ye:
li a0, 0x7F800000 # Set to Infinity, H06
sll t2, t2, 31 # Shift sign bit back, H07
or a0, a0, t2 # Set sign bit
jalr x0, ra, 0
# Case where exponent of Y > exponent of X
__addsf_ye_gt_xe:
# Swap operands and repeat the process
mv t6, a0 # Temporarily store A0
mv a0, a1 # A0 = A1
mv a1, t6 # A1 = original A0
mv t6, a2 # Swap exponents
mv a2, a3
mv a3, t6
mv t6, a4 # Swap significands
mv a4, a5
mv a5, t6
j __addsf_xe_gte_ye # Jump to the same handling
# Handle NaN or Infinity for X
__addsf_x_nan_inf:
add a0, a0, x0 # Return X
jalr x0, ra, 0
# Handle NaN or Infinity for Y
__addsf_y_nan_inf:
add a0, a1, x0 # Return Y
jalr x0, ra, 0
# Return Y if X is zero
__addsf_return_y_flushed:
add a0, a1, x0 # A0 = Y
jalr x0, ra, 0
# Return X if Y is zero
__addsf_return_x:
jalr x0, ra, 0
# Return zero result
__addsf_return_0:
add a0, x0, x0 # A0 = 0
jalr x0, ra, 0
```
H01 = `srli t2, a0, 31`
H02 = `srli t2, a1, 31`
H03 = `bgeu a2, t0, overflow_xe_gte_ye`
H04 = `srli a4, a4, 9`
H05 = `slli t2, t2, 31`
H06 = `0x7F800000`
H07 = `slli t2, t2, 31`
Consider the following implementation of converting a signed 32-bit integer to a single-precision floating-point number (IEEE 754 format) using the RISC-V RV32I assembly. The focus is on understanding the low-level operations required to perform this conversion without relying on pseudo-instructions or extended instruction sets.
```c
# Convert signed int to float
# float i2f(s32 num);
# INPUT: A0 = integer number
# OUTPUT: A0 = float number (IEEE 754 single-precision)
i2f:
# Prepare result sign in A1
srli a1, a0, 31 # A1 = (A0 >> 31), extract sign bit
slli a1, a1, 31 # A1 = A1 << 31, position sign bit at bit 31
# Get absolute value of the number
beq a1, x0, positive # If sign bit is zero, number is positive
sub a0, x0, a0 # A0 = -A0 (negate to get absolute value)
positive:
# Check if number is zero
beq a0, x0, zero_result # If A0 == 0, result is zero
# Store sign and absolute value
add a2, x0, a1 # A2 = A1 (store sign)
add a1, x0, a0 # A1 = A0 (store absolute value)
# Count leading zeros in A0 (result stored in T0)
# Initialize T0 (leading zero count) to 0
add t0, x0, x0 # T0 = 0
clz_loop:
sll t1, a1, 1 # Shift A1 left by 1 bit
addi t0, t0, 1 # T0 = T0 + 1 (increment count)
mv a1, t1 # Update A1 with shifted value
bgez a1, clz_loop # Loop while A1 >= 0 (MSB is 0)
# Adjust for possible overflow in shift
li t1, 32 # Load 32 into t1
blt t0, t1, clz_done # If t0 < 32, proceed to clz_done
li a0, 32 # If t0 >= 32, leading zeros = 32
j normalize
clz_done:
# A0 now contains the number of leading zeros (stored in T0)
add a0, x0, t0 # Move T0 to A0
normalize:
# Normalize the number (shift left by leading zeros)
sub t1, x0, a0 # t1 = -A0
sll a1, a1, t1 # A1 = A1 << (32 - leading zeros), H08
# Prepare result exponent
# Exponent = 158 - leading_zeros
li t1, 158
sub a0, t1, a0 # A0 = 158 - leading zeros
# Rounding: Add 0x80 to A1 (rounding bit at position 7)
addi a1, a1, 128 # A1 = A1 + 0x80
# Check for overflow after rounding
bgez a1, rounding_ok # If A1 >= 0, no overflow
addi a0, a0, 1 # Exponent increment due to rounding overflow
j adjust_mantissa
rounding_ok:
# Check if lowest 8 bits are zero (for rounding to even)
andi a3, a1, 0xFF # A3 = A1 & 0xFF (lowest 8 bits)
beq a3, x0, round_even # If lowest 8 bits are zero, round to even
adjust_mantissa:
# Remove leading hidden bit '1'
slli a1, a1, 1 # A1 = A1 << 1 (remove hidden '1')
j compose_result
round_even:
# Ensure even mantissa (clear least significant bit)
srli a1, a1, 9 # A1 = A1 >> 9
slli a1, a1, 10 # A1 = A1 << 10 (remove hidden '1'), H09
j compose_result
compose_result:
# Align mantissa and exponent
srli a1, a1, 9 # A1 = A1 >> 9 (align mantissa to bits 0..22)
slli a0, a0, 23 # A0 = A0 << 23 (align exponent to bits 23..30), H10
or a0, a0, a1 # A0 = exponent | mantissa
or a0, a0, a2 # A0 = A0 | sign bit (bit 31)
jalr x0, ra, 0 # Return from function
zero_result:
# Return zero (signed zero with correct sign)
or a0, x0, a2 # A0 = sign bit (A2)
jalr x0, ra, 0 # Return from function
```
H08 = `sll a1, a1, t1`
H09 = `slli a1, a1, 10`
H10 = `slli a0, a0, 23`