Consider an algorithm for transposing a square matrixin-place by swapping rows and columns. Below, you will find the C code for this operation, and it is important to note that the matrix elements are 32-bit integers.
#include <stddef.h>
void transpose(size_t n, int *m)
{
for (size_t i = 0; i < n; i++) {
for (size_t j = i + 1; j < n; j++) {
int t = m[(i * n) + j];
m[(i * n) + j] = m[(j * n) + i];
m[(j * n) + i] = t;
}
}
}
Next, we utilize vector-style instructions to implement the transposition. Below, you will find an abbreviated listing of potentially relevant vector load/store instructions.
Vector Load/Store Instructions
instruction | definition |
---|---|
vsetvli rd, rs, eN, mX, tP, mP | This instruction updates rd with the vector length computed; rs is an input register operand that contains the application vector length (AVL) which represents the vector length the program wants to use.N in eN is the sew (8, 16, 32, 64, โฆ); X in mX is the lmul (spelled as fY for 1/Y cases); P is the policy for tail (t) and mask (m): u for undisturbed, a for agnostic |
vle32.v vd, (rs1), vm | vd[i] = mem[(rs1) + i*4] |
vse32.v vs3, (rs1), vm | mem[(rs1) + i*4] = vs3[i] |
vlse32.v vd, (rs1), rs2, vm | vd[i] = mem[(rs1) + i*rs2] |
vsse32.v vs3, (rs1), rs2, vm | mem[(rs1) + i*rs2] = vs3[i] |
vluxei32.v vd, (rs1), vs2, vm | vd[i] = mem[(rs1) + vs2[i]] (unordered) |
vsuxei32.v vs3, (rs1), vs2, vm | mem[(rs1) + vs2[i]] = vs3[i] (unordered) |
vloxei32.v vd, (rs1), vs2, vm | vd[i] = mem[(rs1) + vs2[i]] (ordered) |
vsoxei32.v vs3, (rs1), vs2, vm | mem[(rs1) + vs2[i]] = vs3[i] (ordered) |
Complete the vector code provided below to vectorize the matrix transpose operation.
# a0: n
# a1: m
transpose:
li t0, 1
bleu a0, t0, end # skip if n <= 1
slli t0, a0, __A01__ # initialize t0 with stride in bytes
addi a0, a0, -1 # decrement n
loop_i:
mv t2, a0 # number of elements to swap = n - (i+1)
addi t3, a1, __A02__ # temporary pointer to row at m[i][i+1]
add t4, a1, t0 # temporary pointer to column m[i+1][i]
addi a1, t4, __A03__ # bump m pointer by (n + 1) elements
loop_j:
vsetvli t5, t2, e32, m1, ta, ma
vle32.v v0, (t3) # vector load row m[i][...]
vlse32.v v1, (t4), t0 # vector load column m[...][i]
vsse32.v v0, (t4), t0 # vector store column m[...][i]
vse32.v v1, (t3) # vector store row m[i][...]
sub t2, t2, t5 # decrement vl
mul t6, t5, t0 # vl*stride in bytes
slli t5, t5, 2 # vl in bytes
add t3, t3, t5 # bump row pointer
__A04__ # bump column pointer
bnez t2, loop_j
addi a0, a0, -1 # decrement n
__A05__
end:
ret
- A01 =?
2- A02=?
4- A03=?
4- A04=?
add t4, t4, t6- A05=?
bnez a0, loop_i
For A01, a0*4 is used to jump to the next element in the same column, effectively shifting the memory location by two positions to the left to ensure the correct memory address.
For A02, this is because we need to calculate a temporary pointer pointing to the next element in the current row. Since each element is 4 bytes, we increment pointer a1 (pointing to the beginning of the current row) by 4.
For A03, it should be "addi t4, a1, 0". Here, we set the temporary pointer t4 to point to the beginning of the current row pointed to by a1, as t4 will be used for vector storage instructions, which require pointing to the memory location where we want to store data.
For A04, this instruction is used to update the column pointer after each iteration of the loop.
For A05, this is an instruction used to determine whether transposition is still needed. If a0 is not equal to 0, the transposition continues
Consider REV as a macro-operation fusion instruction that reverses a linked list in memory. The rs1 operand for this instruction is the memory address of a pointer to the first node in the linked list, which is essentially a pointer to a pointer.
REV rs1
Each node in the linked list has the following structure. In this architecture, assume that pointers are 32 bits wide. The โnextโ pointer can either hold the memory address of the next node in the list or be set to 0 (NULL) to indicate the end of the linked list.
struct node {
struct node *next;
void *value;
}
To provide a point of reference, we have included both the C and assembly code equivalents for this instruction below.
void REVLL(struct node **head)
{
struct node *prev = NULL;
struct node *curr = *head;
while (curr) {
struct node *next = curr->next;
curr->next = prev;
prev = curr;
curr = next;
}
}
# head is passed in a0
# t0 holds prev
# t1 holds next
lw a0, 0(a0)
beqz a0, done
addi t0, t0, 0
loop:
lw t1, 0(a0)
sw t0, 0(a0)
addi t0, a0, 0
addi a0, t1, 0
bnez t1, loop
done:
Letโs analyze the execution of the assembly code for linked list reversal on an unpipelined RISC-V core. In this core, each instruction has a CPI (Cycles Per Instruction) of 1, except for load and store instructions, which require 2 cycles each. How many instructions are needed to reverse a linked list with a length of 4 using this program? (Fill in the numbers)
- B01=?
1- B02=?
0- B03=?
2- B04=?
1- B05=?
1- B06=?
3
For the Prologue, there is only one load/store instruction, and the rest of the instructions are paired in twos. As for the Loop, it includes one load instruction, one store instruction, and three other instructions. Therefore, the unpipelined CPI calculation is as follows:
Hence, in this reverse link list assembly code, a total of 32 cycles occur.
A garage door will initiate its opening sequence when it detects the password โ011โ in a transmission
To put it formally, thisFinite-state machine(FSM) accepts a bitstring composed of 0โs and 1โs as input and produces a continuous stream of 0โs until it encounters the substring โ011,โ at which point it begins generating a continuous stream of 1โs. Below are examples of how this FSM operates:
Output: 000000000111111111111
Output: 000000000000000000000
Mark the correct FSM state transition for each numbered arrow.
__ C01 __
__ C02 __
__ C03 __
__ C04 __
- C01=?
0/0- C02=?
1/1- C03=?
1/1- C04=?
0/1
For this FSM, the state transitions are as follows:
For Arrow 5, when the FSM is in State 1 and the input is 0, it should remain in State 1 and output 0 because we haven't detected any part of the "011" pattern yet. So, the transition for C02 should be 0/0.
For Arrow 6, when the FSM is in State 5 and the input is 1, we have successfully matched "01" and are waiting for another 1 to complete the "011" pattern. Therefore, we move to State 6 and output 0 because we haven't completed the entire pattern yet. So, the transition for C02 should be 1/0.
For Arrow 7, it means we have matched the entire "011" pattern. Hence, we transition to State 7 and start outputting 1. So, the transition for C03 should be 1/1.
For Arrow 8, regardless of the input, we have already matched the "011" pattern, so we should stay in State 7 and continue outputting 1. Therefore, the transition for C04 should be 0/1.
Take a look at the circuit diagram below along with the delays of its components:
1.What is the smallest combinational delay of all paths in this circuit, in picoseconds? __ D01 __
- D01=?
25ps
The shortest CL path is between the right register and the top left register, consisting of a NOT gate and an OR gate, for a total delay of 25ps.
2.What is the maximum hold time the registers can have so that there are no hold time violations in the circuit above?
- D02=?
30ps
The shortest CL path is 25ps (see Q6.1), and the maximum hold time is the shortest CL path + clk-to-q, which gives us 30ps.
3.What is the minimum allowable clock period for this circuit to function properly?
- D03=?
117ps
The longest path between any two registers is between the top left register and the register, consisting of a multiplier, a NOT gate, a subtractor, and an AND gate, for a total of 110ps.
Additionally, we need to account for clk-to-q and setup, which gives us 117ps.
For D01, the shortest path is from the right-hand register to the upper-left register, with a calculated delay of
For D02, Maximum Hold Time refers to the duration during which data must remain stable after the clock edge to ensure proper sampling. Maximum Hold Time indicates how long the data can change after the clock edge without violating the register's hold time requirements. Therefore, its calculation is
For D03, the minimum allowable clock period is the shortest clock cycle required for the circuit to operate correctly. It is determined by adding the longest combinational logic path delay to the setup time of the registers. The calculation is as follows:
The pathway diagram is shown above.
Decode the following RISC-V instruction. You can use online RISC-V Instruction Encoder/Decoder.
0x24A009EF
which is mapped to
jal __E01__, __E02__
E01 should start with x.
- E01 =?
x19- E02 =?
586
The online RISC-V instruction to hexadecimal memory address conversion tool allows us to gain a better understanding of the behavior of instructions in the ID (Instruction Decode) stage.
For example, jal
is a U-type instruction, and its general format is jal rd, offset
. The core instruction format for jal
is as follows:
Imm[31:12] | rd | opcode |
---|
The decoding steps are as follows to convert the hexadecimal instruction into binary format:
According to the RISC-V instruction format, decompose the binary instruction into different fields.
Extract the destination register (rd) and the offset from these fields.
Examine the circuit diagram presented below. SEL is a single-bit control signal that undergoes instantaneous updates at the rising edge of each clock cycle, maintaining stability throughout each clock cycle. You can make the assumption that Input will not result in any hold time violations.
The combinational logic block on the left performs a left shift operation on the top input by the number of bits specified by the bottom input. In the diagram, the shifter connected to the register on the left shifts its output to the left by 1 bit.
Suppose we want to maximize the circuitโs frequency without altering its behavior, and we are not allowed to modify the delays of any component.
The most significantly slow component in this context is the multiplier, and addressing this issue is necessary. Specifically, if our sole multiplication requirement is doubling numbers, we do not require a versatile multiplier capable of multiplying any two inputs together.
What is the minimum clock period after implementing the above change? __ F01 __
- F01=?
77 ps
With the multiplier replaced by a shifter, the new longest path starts from either left-side register, goes through the shifter and the mux, and ends at the rightmost register.
From the rising edge, we wait 30 ps for the output to appear out of the left registers (clk-to-q).
Then we wait another 2 ps for the signal to travel through the shifter, and another 25 ps for the multiplexer. Finally, we have to get to the rightmost register 20 ps before the next rising edge to account for setup time. In total, this is 30 + 2 + 25 + 20 = 77 ps.
After modifying the circuit, the longest path now includes a shifter and a multiplexer. Therefore, the longest path delay is calculated as
Here is the pathway diagram:
The function add_even
is specified as follows:
Input:
10
(4+6)Complete the missing sections in the RISC-V code provided below. Each line should consist of exactly one instruction or pseudo-instruction.
add_even:
addi t0, x0, 0 # set t0 to be the running sum
loop:
beq a1, x0, end
lw t1 0(a0) # set t1 to be the number in the array
srli t2, t1, 1
__G01__
bne t1, t2, pass
add t0, t0, t2
pass:
__G02__
addi a0, a0, 4
j loop
end:
mv a0, t0
- G01=?
slli t2, t2, 1- G02=?
addi a1, a1, -1
slli t2, t2, 1
is used to ensure that only even numbers are added to the sum, while addi a1, a1, -1
ensures that the loop will terminate after processing all array elements
Write a simple test program for this function
.data
numbers: .word 2, 5, 8, 10, 13
size: .word 5
.text
.globl main
main:
la a0, numbers
lw a1, size
call add_even
mv a2, a0
call print
li a7, 10
ecall
add_even:
addi t0, x0, 0 # set t0 to be the running sum
loop:
beq a1, x0, end
lw t1 0(a0) # set t1 to be the number in the array
srli t2, t1, 1
slli t2, t2, 1
bne t1, t2, pass
add t0, t0, t2
pass:
addi a1,a1,-1
addi a0, a0, 4
j loop
end:
mv a0, t0
print:
li a7,1
ecall
ret
We obtain an output of
A Hamming Error Correcting Codeis capable of rectifying single-bit errors. In scenarios where data is susceptible to errors, such as in satellite communication or L1 caches, itโs advantageous to employ Hamming codes to store data blocks. These codes can internally correct errors during memory retrieval.
In this context, we will examine a (7,4) Hamming Code with even parity. This code utilizes 7 bits in total to store 4 data bits. Please refer to the following bit pattern for the (7,4) Hamming Code:
We follow the convention where the first bit is considered the most significant in the data, and the seventh bit is the least significant. Then, we create a circuit that, when provided with a (7,4) Hamming Code, produces a corrected nibble of data. What components should be placed in the labeled boxes? Assume that multiple-input gates are created by connecting two-input gates.
It is known that both Box I and Box II belong to this category. Please provide answers accordingly.
- H01=?
XOR gate
Cascaded XOR gates can be used to determine the parity of the input bits; XOR returns 1 if the number of input โ1โ bits is odd, and 0 if the number of input 1 bits is even. This matches our definition of parity.
- H02=?
Multiplexer (with 0 input on top)
We need to output the corrected data based on the parity bits; this can only be done using a multiplexer.
- H03=?
6 ns.
The values of A-H are computed at 1 ns. Then after the 5 ns delay of Box 2, the output changes.- H04=?
8 ns.
The inputs change at 0 ns. Then after the 3 ns delay of Box I, the bottom input to Box II changes. Then after the 5 ns delay of Box II, the output changes.
An XOR gate is used to calculate parity bits in this system. In Hamming code, parity bits are employed to detect and correct single-bit errors. By arranging XOR gates according to the rules of Hamming code, we can identify which bit has an error and correct it. So, H01 represents an XOR gate.
In this system, a multiplexer is used to select the correct data output. If the Hamming code indicates an error, the multiplexer will choose to flip the bit corresponding to the error's position to correct it. Connecting 0 to the top input of the multiplexer is done to keep the data bits unchanged in the absence of errors. Therefore, Box2 represents a Multiplexer.
for Earliest time is
for Latest time is
When the input changes at time 0, Box I takes 3 ns to process the signal. Then, the signal enters Box II, which takes 5 ns to process the signal. Therefore, the output signal will change after the combination of these two delays: 3 ns (Box I) + 5 ns (Box II) = 8 ns
The function verify_password is defined as follows:
Input: a0 is a pointer to a buffer.
Effect: It reads characters from stdin and populates the buffer pointed to by a0 with the read data, null-terminating the string. Your code can assume that the input consists of a maximum of 19 characters, not including the null-terminator.
Output: None
In your current role as a hacker, your objective is to target the implementation of verify_password.
verify_password:
addi sp, sp, -24 # Make space for a 20-byte buffer
sw ra, 20(sp)
mv a0, sp
jal ra, getchars
la t0, password
mv t1, sp
Loop:
lb t2, 0(t0)
lb t3, 0(t1)
bne t2, t3, Fail
beq t2, x0, Pass
addi t0, t0 1
addi t1, t1 1
j Loop
Pass:
li a0, 1
j End
Fail:
li a0, 0
End:
lw ra, 20(sp)
addi sp, sp, 24
jr ra
During your testing, you have uncovered an intriguing revelation: the function getchars does not function as originally intended. Instead of truncating at the 20th character, getchar continues writing data until it encounters the first null terminator within its input. Just as before, verify_password resides at memory address 0x1000, and getchars is situated at address 0x0F00. Additionally, assume that the stack pointer is initially positioned at 0xBFFF F800 when verify_password begins execution. Our page size is 4 KiB, and we are currently operating on a little-endian system.
- I01=?
5
The buffer is 20 bytes long. Each RISC-V instruction is 4 bytes long. In total, we can fit 20 / 4 = 5 instructions in the buffer.
# Allocate a buffer of 256 bytes, which does not overlap with any data
# we already are using (such as the instructions injected in part 1)
1: addi sp, sp, __I02__
# Set the argument of getchars to the start of the allocated buffer
2: mv __I03__, sp
# Set t1 so the next instruction jumps to getchars
3: lui t1, __I04__
4: jalr ra t1 -256 # Call getchars
# Jump to the start of the buffer
5. jr __I05__
- I02=?
-280.
We allocate 256 bytes of buffer, plus 24 bytes to avoid overlap with the injected instructions and the saved ra register from part 1, for a total of 280 bytes. (This does mean that we rely on data after our stack pointer to stay constant, which is technically undefined behavior. In reality, the data will stay consistent for long enough that
we move the stack pointer back down; since our sp starts in the middle of a page, the data will not be deallocated.- I03=?
a0
The getchars function expects the address of a buffer as an argument in the a0 register.- I04=?
1
This puts the value 0x1000 in t1, which then causes the next line to jump to 0x1000 - 256 = 0x0F00, the address of Get20Chars. Note that we needed to use lui here, since addi can only increase by up to 0x7FF. Line 4 could also not have been a jal operation, since our jump distance would have exceeded the size of a 20-bit immediate.- I05=?
sp
We want to jump to the address in the sp register, and we do not care about saving a return address anywhere.
a. ret
b. jalr x0, t1, -256
c. jalr s0, t1, -256
d. jalr ra, t2, -256
e. jalr ra, t0, 16
f. jalr s0, x0, 3840
g. None of the above
- I06 = ?
d, e
ret = jr ra = jalr x0 ra 0 and jalr x0, t1 -256 contain a null byte when translated to machine code: 0x00008067 and 0xF0030067.
jalr s0 x0 3840 is an invalid instruction because the immediate 3840 is greater than 2047.
I-type immediates must be between -2048 and +2047.
jalr s0 t1 -256 and jalr ra t0 16 are valid, and do not have a null byte; note that jalr s0, t1, -256 = 0xF0030467 does not have a null byte, because it gets split up into bytes as 0x67 0x04 0x03 0xF0.
FOR I02
To allocate a new buffer that won't overlap with the already injected instructions, we allocated an additional 256 bytes in addition to the original 20 bytes, plus an extra 24 bytes reserved to avoid overlap. So, in total, we need to move the stack pointer down by 280 bytes.
For I03
The getchars function expects to receive the address of a buffer in the
a0
register, so here, we move the value of the stack pointersp
intoa0
For I04
setting the t1 register to 0x1000 and then using the jalr instruction, we can make the program counter jump to this address. Since the address of the getchars function is 0xF00, we need to subtract 256 (0x100) from 0x1000 to correctly jump to getchars
For I06
Among the given options, "ret" and "jalr x0, t1, -256" meet these criteria because they do not contain immediate values beyond the range, and when converted into machine code, they do not have null bytes. This is important for certain security vulnerability exploits
Examine the provided RISC-V code snippet:
Loop:
andi t2, t1, 1
srli t3, t1, 1
bltu t1, a0, Loop
jalr s0, s1, MAX_POS_IMM
...
- J01=?
-8
Two instructions away = -8 bytes Each instruction isbyte,so to go back to Loop, it needs to minus bytes.
- J02=?
0xA000 0004- J03=?
0x4000 0000- J04=?
0x4000 0FFF
We know that rd and rs1 fields are now 6 bits. jalr is an I-type instruction, so we take out the funct3 bits but we give each of rd and rs1 fields 1 bit, meaning we have 1 bit leftover to give to the immediate field. Thus, we now have a 13-bit immediate. Thus, the maximum possible immediate a jalr instruction can hold ishalfwords away, which is represented as 0b0 1111 1111 1111, which is 0x0FFF.
s0 is the linking register โ itss value is PC + 4
s1 does not get written into so it stays the same.
PC = R[s1] + 0x0FFF
bltu is SB-format instruction Its description is if([Rs1]<[Rs2] PC=PC+imm{imm,1b'0}
The core instruction format is as follows:
Imm[12I10:5] | rs2 | rs1 | funct3 | imm[4:1I11] | opcode |
---|
Since each instruction is 4 bytes long, and the "Loop" label is two instructions before the current instruction's position, the byte offset is -8 (-2 instructions * 4 bytes/instruction).
The jalr
instruction sets the Program Counter (PC) to the value in register s1 plus the value in the immediate field and stores the address of the next instruction (current PC
value + 4) into register s0. Because MAX_POS_IMM
is the maximum possible immediate value, the value in s0 register will be the current PC value plus 4.
This is about the value of register s1 after the "jalr" instruction. In this instruction, s1 is not modified, so the value of register s1 will remain as it was before the instruction execution, which is 0x40000000
.
Given that s1 is 0x40000000
, and the immediate value is the maximum value 0x0FFF (because the immediate has 13 bits, and the maximum value is 12 bits of 1), the new PC value will be 0x40000FFF
.
Letโs examine the program that recursively calculates the Fibonacci sequenceOn the left, you will find the C code, and on the right, you can see its corresponding translation into RISC-V assembly. Itโs important to note that the execution has been paused just before executing the โretโ instruction. The โSPโ label on the stack frame (part 3) indicates the location where the stack pointer is pointing when the execution was halted.
int fib(int n)
{
if (n <= 1)
return n;
return fib(n - 1) + fib(n - 2);
}
fib:
addi sp, sp, -12
sw ra, 0(sp)
sw a0, 4(sp)
sw s0, 8(sp)
li s0, 0
li a7, 1
if:
ble __K01__
L2:
addi a0, a0, -1
call fib
add s0, s0, a0
lw a0, 4(sp)
addi a0, a0, -2
call fib
mv t0, a0
add a0, s0, t0
done:
lw ra, 0(sp)
lw s0, 8(sp)
L1:
addi sp, sp, 12
ret
- K01=?
a0, a7, done
a0 is n in C code.
a7 is 1 in C code.
done is ready to return in C code.
We need to fill in the destination for the jump. In this function, if n is less than or equal to 1, it executes the code under the "done" label; otherwise, it continues executing the code under the "L2" label. So, K01 should be filled with "a0, a7, done," which means that if a0 is less than or equal to a7 (where a7 is initialized to 1), it will jump to the "done" label. Here, a0 represents n in C code, and a7 represents 1 in C code.
- K02=?
3
there are three sw instructions that respectively store the values of the ra, a0, and s0 registers into the stack. Each word is composed of 4 bytes, totaling 12 byte
Notation | address |
---|---|
Smaller address | 0x280 |
0x1 | |
K03 | |
SP -> | K04 |
K05 | |
0x0 | |
0x280 | |
0x3 | |
0x0 | |
0x2108 | |
0x4 | |
0x6 | |
Larger address | 0x1 |
- K03=?
0x0- K04=?
0x280- K05=?
0x2
for K03 This represents the value at the address where the Stack Pointer (SP) was before calling the fib function. Here, the value of K03 should be the value of the ra register before calling the fib function, which is 0x280, indicating the position of the Program Counter before calling the fib function.
for K04 This is the value of the a0 register after the L2 instruction, which represents the result of n-1. Since the instruction is addi a0, a0, -1, this operation subtracts 1 from the value of n
for K05 This is the value of the Stack Pointer (SP) after the second call to fib(n-2). After the first recursive call, n is reduced by 1, so the value at this position should represent the result of n-2
Initial n is 0x3, and the after the inst. L2: addi a0, a0, -1 the k05 becomes 0x3-1=0x2.
K04 = ra = 0x280
s0 is to store the temporary value to add fib(n - 1) + fib(n - 2) together equal to inst add a0, s0, t0.
- K06 = ?
0x298
There have 6 instructions between the first call fib and done label. Therefore the address is 0x280+6*4=0x298
The done
label is located after the first call to the fib function and the subsequent 6 instructions. Since each instruction occupies 4 bytes, the address of the done
label will be the initial address plus 6 * 4 = 24
bytes. If the initial address is 0x280, then the address of the done
label will be 0x280 + 0x18
, which is 0x298
.
- K07 = ?
0x2104
Since call would store the next instructionโs address so the first ra minus 4 is the answer, which is 0x2108.
Since the call
instruction stores the address of the next instruction in the ra register, this address is the address of the calling instruction plus 4. If the value of the ra register is 0x2108, then the address of the call fib
instruction will be 0x2108 - 4, which is 0x2104.