General Information
A
Coroutines are computer program components that generalize subroutines for non-preemptive multitasking, by allowing execution to be suspended and resumed. In order to implement general-purpose coroutines, a second call stack must be obtained, which is a feature not directly supported by the C language. A reliable (albeit platform-specific) way to achieve this is to use a small amount of inline assembly to explicitly manipulate the stack pointer during initial creation of the coroutine. Once a second call stack has been obtained, the setjmp and longjmp functions in the standard C library can then be used to implement the switches between coroutines.
These functions save and restore, respectively, the stack pointer, program counter, callee-saved registers, and any other internal state as required by the ABI, such that returning to a coroutine after having yielded restores all the state that would be restored upon returning from a function call. Minimalist implementations, which do not piggyback off the setjmp and longjmp functions, may achieve the same result via a small block of inline assembly which swaps merely the stack pointer and program counter, and clobbers all other registers. This can be significantly faster, as setjmp and longjmp must conservatively store all registers which may be in use according to the ABI, whereas the clobber method allows the compiler to store (by spilling to the stack) only what it knows is actually in use.
Let's implement a minimalist coroutine in RV32I, assuming no floating points in 32-bit.
Makefile
main.c
context.S
(incomplete)Please implement helper
with minimal offset to make it work on rv32emu. You should answer with register ABI names.
- A01 = ?
- A02 = ?
B
Assume that we are about to implement spinlock with RV32A.
The above is shown with GNU inline assembly.
Both B02 and B02 MUST contain %
. You should pick one of %0
and %1
. Both B01 and B03 are atomic instructions and containg .w.
in their mnemonic form.
- B01 = ?
- B02 = ?
- B03 = ?
- B04 = ?
C
Given the following RISC-V machine code (and instruction addresses), disassembly manually and fill in C01
and C02
for the following instructions.
See RISC-V instructions for encoding.
Both C01
and C02
MUST contain label names. That is, loop
and foo
.
- C01 = ?
- C02 = ?
D
We are going to write a function sumsqr
in RISC-V that, when given an integer , returns the summation below. If is not positive, then the function returns 0.
You are provided a RISC-V function named square
for this problem, which accepts an integer and returns its square. Square is used as a subroutine in sumsqr
to be implemented.
Fill the above D01
and D02
with valid RISC-V assembly.
- D01 = ?
- D02 = ?
E
Assume that our system certains memory addresses correspond to registers in I/O devices and not normal memory. That is, memory-mapped I/O.
0xFFFF0000
– Receiver Control: LSB is the ready bit, there may be other bits set that we do not need right now.0xFFFF0004
– Receiver Data: Received data stored at lowest byte.0xFFFF0008
– Transmitter Control: LSB is the ready bit, there may be other bits set that we do not need right now.0xFFFF000C
– Transmitter Data: Transmitted data stored at lowest byteWe write the following RISC-V code to read a byte from the receiver and immediately send it to the transmitter.
Fill E01
, E02
, and E03
to make it work as expected.
- E01 = ?
- E02 = ?
- E03 = ?
F
The benefits of multi-threading programming come only after you understand concurrency. Here are two most common concurrency issues:
To solve problem with Read-modify-write, we have to rely on the idea of undisrupted execution. Consider the following code using Amo.swap (single, undisrupted memory operation) to achieve the atomic primitive Test-and-set (TaS).
Fill the above F01
to make TaS work as expected.
- F01 = ?
why do we need special instructions for these operations? Why can't we use normal load and store? Explain!
- F02 = ?
G
The following shows a classic fully-bypassed 5-stage pipeline that has been augmented with an unpipelined divider in parallel with the ALU. Bypass paths are not shown in the diagram. This iterative divider produces 2 bits per cycle until it outputs a full 32-bit result.
In this design, asynchronous interrupts are handled in the MEM
stage and cause a jump to a dedicated interrupt trap handler address. The interrupt latency is defined as the number of cycles from when an interrupt request is raised in the MEM
stage until the first instruction of the interrupt handler reaches the MEM
stage.
Consider the execution of the code below. Suppose an interrupt is raised during cycle 8, which causes a jump to isr
. Here, isr
stands for interrupt Service Routine or interrupt handler. The handler increments a counter at a fixed memory address before returning to the original context. The architectural guarantee of precise interrupts should be upheld. Assume that all memory accesses take one cycle in the MEM
stage.
Note that the div
instruction in RISC-V can not raise a data-dependent exception. To avoid pipeline stalls while a multi-cycle divide operation is in progress, the pipeline control logic allows subsequent instructions that do not depend on the divide result to be issued and completed before the divide has completed.
The RISC-V MRET
, HRET
, SRET
, or URET
instructions are used to return from traps in M-mode, H-mode, S-mode, or U-mode respectively. When executing an xRET
instruction, supposing xPP
holds the value y, yIE
is set to xPIE
; the privilege mode is changed to y; xPIE
is set to 1; and xPP
is set to U.
Typically the trap handlers will save the current state
and restore the state on the MRET
instruction:
Reference: The RISC-V Instruction Set Manual Volume II: Privileged Architecture
What is the interrupt latency for the code above? (Answer in cycles)
- G01 = ?
Which instruction should isr
return to in order to ensure that the program will continue to execute correctly? Answer in one of the instructions shown above.
- G02 = ?
In order to reduce the interrupt latency, we would constrain the destination register for a div
instruction. The ABI could be changed to reserve a specific x register to always hold the divide result, so the interrupt handler can avoid using it. In a more generalized exception handler, reading the register can be deferred towards the end of the context save, which might hide the divide latency. Does this approach require any hardware modifications? G03 (Answer in Yes
or No
) What is the corresponding limitation? G04 (Explain!)
- G03 = ?
- G04 = ?
Let's think of the alternative approach to reduce the interrupt latency. The div
instruction could be redefined to write to a special-purpose register separate from the x register file. This is the approach adopted by MIPS with its hi
and lo
registers, but it involves an ISA change with RISC-V. A tempting proposition is to kill the ongoing divide operation and save the intermediate pipeline state so that it can be restarted after the interrupt. Would this approach create precise interrupt? Explain!
- G05 = ?
H
Consider a RISC-V processor that includes 230 bytes of virtual memory, 220 bytes of physical memory, and uses a page size of 27 bytes. With a 4-element, fully-associative Translation Lookaside Buffer (TLB) with an LRU replacement policy. A program running on the processor is halted right before executing the following instruction located at address 0x4BA
:
The first 8 entries of the page table are shown below. The page table uses an LRU replacement policy. Assume that all physical pages are currently in use.
VPN | V | R | D | PPN | |
---|---|---|---|---|---|
LRU | 0xD09 | 1 | 1 | 0 | 0x2 |
0x1 | 1 | 1 | 0 | 0xFA | |
0xCAF | 1 | 1 | 1 | 0xFF | |
0x3 | 1 | 1 | 0 | 0xB |
TLB
VPN | R | D | PPN | |
---|---|---|---|---|
Next LRU | 0 | 1 | 0 | 0x13 |
1 | 1 | 0 | 0xFA | |
2 | 0 | 0 | 0x3 | |
3 | 1 | 0 | 0xB | |
4 | 1 | 0 | 0x1 | |
5 | 0 | - | - | |
6 | 0 | - | - | |
LRU | 7 | 1 | 0 | 0xF |
… |
Page Table
For each virtual address in the chart below, please indicate the VPN, whether or not the access results in a TLB Hit, whether or not the access results in a page fault, the PPN, and the physical address. If there is not enough information given to determine a given value, please write None
. Please write all numerical values in hexadecimal.
Virtual Address | VPN | TLB Hit (Answer in Yes/No) | Page Fault (Answer in Yes/No) | PPN | Physical Address |
---|---|---|---|---|---|
0x4BA |
0x4 |
H01 | H02 | H03 | H04 |
0xCAFE4 |
H05 | H06 | H07 | H08 | H09 |
- H01 = ?
- H02 = ?
- H03 = ?
- H04 = ?
- H05 = ?
- H06 = ?
- H07 = ?
- H08 = ?
- H09 = ?
I
Below shows the solution to two process mutual exclusion problem. Answer True
or False
to the following questions:
Since variable turn
can take only two values then it imposes "strict alternations" between two processes.
- I01 = ?
This algorithm satisfies the "mutual exclusion" and "progress" conditions but not the "bounded waiting" condition.
- I02 = ?
This algorithm may cause "deadlock" if both processes set their flags to True at the same time.
- I03 = ?
This algorithm has a flaw as the variable turn
can be modified by both processes at the same time.
- I04 = ?
J
Consider the C code below:
We are about to convert to RISC-V assembly:
Fill in J01
(Line 104), J02
(Line 119), and J03
(Line 131) to make the above program work as expected.
- J01 = ?
- J02 = ?
- J03 = ?
K
Assume that we are working on a new microprocessor with new ISA. We are bored with the different RISC-V instruction types, and therefore we decide to include only one, universal type – the Z-type instruction.
Say we wish to include the following instructions:
add rd1, rs1, rs2
and rd1, rs1, rs2
lw rd1, offset1 (rs1)
sw rs2, offset1 (rs1)
addi rd1, rs1, imm1
beq rs1, rs2, offset1
lui rd1, offset1
jal rd1, imm
stw rs3, offset1, offset2 (rs1)
The new stw
instruction stores the contents of rs3 into both rs1 + offset1
and rs1 + offset2
. The RTL is:
Mem(R[rs1] + offset1) ← R[rs3] AND Mem(R[rs1] + offset2) ← R[rs3]
The funct3
and funct7
fields should be eliminated, and simply an opcode should be used instead. How many bits must the opcode field have if we simply want to support the above instructions? (Answer in bits)
- K01 = ?
With only one instruction, we want to be able to jump up to 64 KiB in either direction. How many bits are required to encode an immediate for us to be able to achieve this? Assume that, like RV32, the instruction's least significant bit is an implicit 0 and not saved. (Answer in bits)
- K02 = ?
Then, we ultimately choose the following instruction format. We have added a few fields to account for any future additions of additional instructions. The list index provided at the start of this problem's instructions is also the opcode for each instruction (sw
, for instance, has opcode 3
).
imm2 imm1 rs3 rs2 rs1 rd2 rd1 opcode
This instruction format is quite long, so we decide to work on a 64-bit machine. Each immediate field is 11
bits, and the opcode is 7
bits. What is the maximum number of registers we can have? (Answer in registers)
- K03 = ?