Quiz3 of Computer Architecture (2023 Fall)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

General Information

You are allowed to read lecture materials.
- That is, an open book exam.
We are using the honor system during this quiz, and would like you to accept the following:
1. You will not share the quiz with anyone.
2. You will not discuss the material on the quiz with anyone until after solutions are released.
Each answer has 2.5 points.
You must answer in valid numeric representation and/or English alphabet except your formal name.
Message the instructor via Facebook if you found something wrong.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
09:10 ~ 10:20AM on Oct 24, 2023

Problem `A`

Consider an algorithm for transposing a square matrix in-place by swapping rows and columns. Below, you will find the C code for this operation, and it is important to note that the matrix elements are 32-bit integers.

#include <stddef.h>
void transpose(size_t n, int *m)
{
    for (size_t i = 0; i < n; i++) {
        for (size_t j = i + 1; j < n; j++) {
            int t = m[(i * n) + j];
            m[(i * n) + j] = m[(j * n) + i];
            m[(j * n) + i] = t;
        }
    }
}

Next, we utilize vector-style instructions to implement the transposition. Below, you will find an abbreviated listing of potentially relevant vector load/store instructions.

Vector Load/Store Instructions

instruction	definition
`vsetvli rd, rs, eN, mX, tP, mP`	This instruction updates `rd` with the vector length computed; `rs` is an input register operand that contains the application vector length (AVL) which represents the vector length the program wants to use. N in eN is the sew (8, 16, 32, 64, …); X in mX is the lmul (spelled as fY for 1/Y cases); P is the policy for tail (t) and mask (m): u for undisturbed, a for agnostic
`vle32.v vd, (rs1), vm`	`vd[i] = mem[(rs1) + i*4]`
`vse32.v vs3, (rs1), vm`	`mem[(rs1) + i*4] = vs3[i]`
`vlse32.v vd, (rs1), rs2, vm`	`vd[i] = mem[(rs1) + i*rs2]`
`vsse32.v vs3, (rs1), rs2, vm`	`mem[(rs1) + i*rs2] = vs3[i]`
`vluxei32.v vd, (rs1), vs2, vm`	`vd[i] = mem[(rs1) + vs2[i]]` (unordered)
`vsuxei32.v vs3, (rs1), vs2, vm`	`mem[(rs1) + vs2[i]] = vs3[i]` (unordered)
`vloxei32.v vd, (rs1), vs2, vm`	`vd[i] = mem[(rs1) + vs2[i]]` (ordered)
`vsoxei32.v vs3, (rs1), vs2, vm`	`mem[(rs1) + vs2[i]] = vs3[i]` (ordered)

See also: RISC-V "V" Vector Extension

Complete the vector code provided below to vectorize the matrix transpose operation.

    # a0: n
    # a1: m
transpose:
    li t0, 1
    bleu a0, t0, end # skip if n <= 1
    slli t0, a0, __A01__  # initialize t0 with stride in bytes
    addi a0, a0, -1 # decrement n
loop_i:
    mv t2, a0 # number of elements to swap = n - (i+1)
    addi t3, a1, __A02__ # temporary pointer to row at m[i][i+1]
    add t4, a1, t0 # temporary pointer to column m[i+1][i]
    addi a1, t4, __A03__ # bump m pointer by (n + 1) elements
loop_j:
    vsetvli t5, t2, e32, m1, ta, ma
    vle32.v v0, (t3) # vector load row m[i][...]
    vlse32.v v1, (t4), t0 # vector load column m[...][i]
    vsse32.v v0, (t4), t0 # vector store column m[...][i]
    vse32.v v1, (t3) # vector store row m[i][...]
    sub t2, t2, t5 # decrement vl
    mul t6, t5, t0 # vl*stride in bytes
    slli t5, t5, 2 # vl in bytes
    add t3, t3, t5 # bump row pointer
    __A04__ # bump column pointer
    bnez t2, loop_j
    addi a0, a0, -1 # decrement n
    __A05__
end:
    ret

A01 = ?

A02 = ?

A03 = ?

A04 = ?

A05 = ?

Problem `B`

Consider REV as a macro-operation fusion instruction that reverses a linked list in memory. The rs1 operand for this instruction is the memory address of a pointer to the first node in the linked list, which is essentially a pointer to a pointer.

REV rs1

Each node in the linked list has the following structure. In this architecture, assume that pointers are 32 bits wide. The 'next' pointer can either hold the memory address of the next node in the list or be set to 0 (NULL) to indicate the end of the linked list.

struct node {
    struct node *next;
    void *value;
}

To provide a point of reference, we have included both the C and assembly code equivalents for this instruction below.

void REV(struct node **head)
{
    struct node *prev = NULL;
    struct node *curr = *head;
    while (curr) {
        struct node *next = curr->next;
        curr->next = prev;
        prev = curr;
        curr = next;
    }
}

    # head is passed in a0
    # t0 holds prev
    # t1 holds next
    lw a0, 0(a0)
    beqz a0, done
    addi t0, t0, 0
loop:
    lw t1, 0(a0)
    sw t0, 0(a0)
    addi t0, a0, 0
    addi a0, t1, 0
    bnez t1, loop
done:

Let's analyze the execution of the assembly code for linked list reversal on an unpipelined RISC-V core. In this core, each instruction has a CPI (Cycles Per Instruction) of 1, except for load and store instructions, which require 2 cycles each. How many instructions are needed to reverse a linked list with a length of 4 using this program? (Fill in the numbers)

Prologue: __ B01 __ load, __ B02 __ store , __ B03 __ other instructions
Loop: __ B04 __ load, __ B05 __ store, __ B06 __ other instructions

B01 = ?

B02 = ?

B03 = ?

B04 = ?

B05 = ?

B06 = ?

Problem C

A garage door will initiate its opening sequence when it detects the password '011' in a transmission. To put it formally, this Finite State Machine (FSM) accepts a bitstring composed of 0's and 1's as input and produces a continuous stream of 0's until it encounters the substring '011,' at which point it begins generating a continuous stream of 1's. Below are examples of how this FSM operates:

Input: 010101001100001010101
- Output: 000000000111111111111
Input: 000000000001000000010
- Output: 000000000000000000000

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Mark the correct FSM state transition for each numbered arrow.

Arrow 5

__ C01 __
Arrow 6

__ C02 __
Arrow 7

__ C03 __
Arrow 8

__ C04 __

C01 = ?

C02 = ?

C03 = ?

C04 = ?

Problem D

Take a look at the circuit diagram below along with the delays of its components:

t_multiplier = 50ps
t_subtract = 35ps
t_XOR = 25ps
t_AND = 20ps
t_OR = 20ps
t_NOT = 5ps
t_clk-q = 5ps
t_setup = 2ps

What is the smallest combinational delay of all paths in this circuit, in picoseconds? __ D01 __
- D01 = ?
What is the maximum hold time the registers can have so that there are no hold time violations in the circuit above? __ D02 __
- D02 = ?
What is the minimum allowable clock period for this circuit to function properly?
- D03 = ? __ D03 __

Problem E

Decode the following RISC-V instruction. You can use online RISC-V Instruction Encoder/Decoder.

0x24A009EF

which is mapped to

jal __E01__, __E02__

E01 should start with x.

E01 = ?

E02 = ?

Problem F

Examine the circuit diagram presented below. SEL is a single-bit control signal that undergoes instantaneous updates at the rising edge of each clock cycle, maintaining stability throughout each clock cycle. You can make the assumption that Input will not result in any hold time violations.

t_multiplier = 1000 ps
t_clk-q = 30 ps
t_mux = 25 ps
t_setup = 20 ps
t_NOT = 8 ps
t_shifter = 2 ps

The combinational logic block on the left performs a left shift operation on the top input by the number of bits specified by the bottom input. In the diagram, the shifter connected to the register on the left shifts its output to the left by 1 bit.

Suppose we want to maximize the circuit's frequency without altering its behavior, and we are not allowed to modify the delays of any component.

The most significantly slow component in this context is the multiplier, and addressing this issue is necessary. Specifically, if our sole multiplication requirement is doubling numbers, we do not require a versatile multiplier capable of multiplying any two inputs together.

What is the minimum clock period after implementing the above change? __ F01 __

F01 = ?

Problem G

The function add_even is specified as follows:
Input:

a0: Pointer to the beginning of an array containing 32-bit integers.
a1: The number of integers in the array.
Output:
a0: The sum of all even numbers within the array.

Example: Let's consider an input where a0 points to

[4, 5, 6, 7]

, and a1 holds the value 4. In this case, the output in a0 will be 10 (4 + 6).

Complete the missing sections in the RISC-V code provided below. Each line should consist of exactly one instruction or pseudo-instruction.

add_even:
    addi t0, x0, 0 # set t0 to be the running sum
loop:
    beq a1, x0, end
    lw t1 0(a0) # set t1 to be the number in the array
    srli t2, t1, 1
    __G01__
    bne t1, t2, pass
    add t0, t0, t2
pass:
    __G02__
    addi a0, a0, 4
    j loop
end:
    mv a0, t0

G01 = ?

G02 = ?

Problem H

A Hamming Error Correcting Code is capable of rectifying single-bit errors. In scenarios where data is susceptible to errors, such as in satellite communication or L1 caches, it's advantageous to employ Hamming codes to store data blocks. These codes can internally correct errors during memory retrieval.

In this context, we will examine a (7,4) Hamming Code with even parity. This code utilizes 7 bits in total to store 4 data bits. Please refer to the following bit pattern for the (7,4) Hamming Code:

We follow the convention where the first bit is considered the most significant in the data, and the seventh bit is the least significant. Then, we create a circuit that, when provided with a (7,4) Hamming Code, produces a corrected nibble of data. What components should be placed in the labeled boxes? Assume that multiple-input gates are created by connecting two-input gates.

It is known that both Box I and Box II belong to this category. Please provide answers accordingly.

AND gate
OR gate
XOR gate
Adder gate
Multiplier gate
Multiplexer (with 0 input on top)
Demultiplexer (with 0 input on top)
Priority Encoder (with 0 input on top)

What should be placed inside Box I? __ H01 __

H01 = ?

What should be placed inside Box II? __ H02 __

H02 = ?

Given that the component in Box I has a 3 ns delay, the component in Box II has a 5 ns delay, and the computation of labels A-H takes 1 ns, if the inputs are received at time 0, what are the earliest and latest times at which the output can change in response?
- Earliest: __ H03 __ ns
- Latest: __ H04 __ ns
- H03 = ?
- H04 = ?

Problem I

The function verify_password is defined as follows:

Input: There are no register inputs; however, the function receives a string input from stdin.
Output: a0 returns 1 if the input from stdin matches "secretpass," and 0 otherwise.

You have access to the following externally defined labels:

password: This is a pointer to a statically stored string "secretpass."
getchars: This function is defined as follows:
- Input: a0 is a pointer to a buffer.
  
  Effect: It reads characters from stdin and populates the buffer pointed to by a0 with the read data, null-terminating the string. Your code can assume that the input consists of a maximum of 19 characters, not including the null-terminator.
- Output: None

In your current role as a hacker, your objective is to target the implementation of verify_password.

verify_password:
    addi sp, sp, -24 # Make space for a 20-byte buffer
    sw ra, 20(sp)
    mv a0, sp
    jal ra, getchars
    la t0, password
    mv t1, sp
Loop:
    lb t2, 0(t0)
    lb t3, 0(t1)
    bne t2, t3, Fail
    beq t2, x0, Pass
    addi t0, t0 1
    addi t1, t1 1
    j Loop
Pass:
    li a0, 1
    j End
Fail:
    li a0, 0
End:
    lw ra, 20(sp)
    addi sp, sp, 24
    jr ra

During your testing, you have uncovered an intriguing revelation: the function getchars does not function as originally intended. Instead of truncating at the 20th character, getchar continues writing data until it encounters the first null terminator within its input. Just as before, verify_password resides at memory address 0x1000, and getchars is situated at address 0x0F00. Additionally, assume that the stack pointer is initially positioned at 0xBFFF F800 when verify_password begins execution. Our page size is 4 KiB, and we are currently operating on a little-endian system.

Given the ability to relocate the program counter to any desired location, we intend to execute a jump to our custom RISC-V code. To accomplish this, our plan is to initiate a jump to the beginning of the buffer stored on the stack. What is the maximum number of RISC-V standard instructions we can inject into this buffer? __ I01 __

I01 = ?

Setting aside your previous response, let's assume we can insert a maximum of 5 instructions into the buffer. However, this limitation doesn't provide us with sufficient instructions to achieve our goals effectively. Consequently, we opt to introduce code that grants us the ability to execute more extensive programs, rather than being confined to only 5 instructions. Please fill in the following 5-line code, ensuring that you can employ pseudo-instructions, as long as they ultimately resolve to exactly one instruction.

# Allocate a buffer of 256 bytes, which does not overlap with any data
# we already are using (such as the instructions injected in part 1)
1: addi sp, sp, __I02__

# Set the argument of getchars to the start of the allocated buffer
2: mv __I03__, sp

# Set t1 so the next instruction jumps to getchars
3: lui t1, __I04__
4: jalr ra t1 -256 # Call getchars

# Jump to the start of the buffer
5. jr __I05__

I02 = ?

I03 = ?

I04 = ?

I05 = ?

Among the following jump instructions, which ones are suitable for use in our injected code? There's no need to be concerned about these lines not correctly calling getchars; our goal is to have a valid RISC-V jump instruction. Additionally, please take note that +3840 is equivalent to 0x00000F00. Select all possible choices. __ I06 __
a. ret
b. jalr x0, t1, -256
c. jalr s0, t1, -256
d. jalr ra, t2, -256
e. jalr ra, t0, 16
f. jalr s0, x0, 3840
g. None of the above
- I06 = ?

Problem J

Examine the provided RISC-V code snippet:

Loop:
    andi t2, t1, 1
    srli t3, t1, 1
    bltu t1, a0, Loop
    jalr s0, s1, MAX_POS_IMM
    ...

What is the value of the byte offset that would be stored in the immediate field of the bltu instruction?

J01 = ?

We propose a revision to the standard 32-bit RISC-V instruction formats. In this revised format, each instruction will have a unique opcode, still represented using 7 bits. As a result of this change, we will eliminate the funct3 field from the R, I, S, and SB instructions, allowing us to allocate bits to other instruction fields, while retaining the opcode field. Assuming that s0 = 0x10000000, s1 = 0x40000000, and PC = 0xA0000000, let's examine the instruction: jalr s0, s1, MAX_POS_IMM, where MAX_POS_IMM represents the maximum possible positive immediate value for jalr. After this instruction executes, what will be the values in the following registers? (Provide your answers in hexadecimal.)
- s0 = J02
- s1 = J03
- PC = J04

J02 = ?

J03 = ?

J04 = ?

Problem `K`

Let's examine the program that recursively calculates the Fibonacci sequence. On the left, you will find the C code, and on the right, you can see its corresponding translation into RISC-V assembly. It's important to note that the execution has been paused just before executing the 'ret' instruction. The 'SP' label on the stack frame (part 3) indicates the location where the stack pointer is pointing when the execution was halted.

C code

int fib(int n)
{
    if (n <= 1)
        return n;
    return fib(n - 1) + fib(n - 2);
}

RISC-V Assembly (incomplete)

fib:
     addi sp, sp, -12
     sw ra, 0(sp)
     sw a0, 4(sp)
     sw s0, 8(sp)
     li s0, 0
     li a7, 1
if:  
     ble __K01__
L2: 
     addi a0, a0, -1
     call fib
     add s0, s0, a0
     lw a0, 4(sp)
     addi a0, a0, -2
     call fib
     mv t0, a0
     add a0, s0, t0
done:
     lw ra, 0(sp)
     lw s0, 8(sp)
L1:
     addi sp, sp, 12
      ret

Complete the missing portion of the ble instruction to make the assembly implementation match the C code.
- K01 = ?
How many unique words will be reserved and pushed onto the stack each time the fib function is called?
- K02 = ?
Kindly provide the values to replace the blank spaces in the stack trace below. Present the values in hexadecimal format.

Notation address

Smaller address 0x280

0x1

K03

SP
$\to$ K04

K05

0x0

0x280

0x3

0x0

0x2108

0x4

0x6

Larger address 0x1
- K03 = ?
- K04 = ?
- K05 = ?
What is the hex address of the done label? (Answer in hexadecimal format)
- K06 = ?
What was the address of the original function call to fib? (Answer in hexadecimal format)
- K07 = ?

Notation	address
Smaller address	0x280
	0x1
	K03
SP $\to$	K04
	K05
	0x0
	0x280
	0x3
	0x0
	0x2108
	0x4
	0x6
Larger address	0x1