:information_source: General Information
ptr->member
instead of (*ptr).member
.<stdint.h>
, <stddef.h>
, <stdlib.h>
, <stdio.h>
, and <string.h>
.A
Given that we are developing an operating system kernel for RISC-V, implement mutex routines to handle locking and unlocking.
Complete the above implementation to ensure the mutex functions correctly using the RISC-V Atomic extension.
Next, we will implement a condition variable based on the mutex implementation provided earlier.
Complete the above implementation to ensure the condition variable function-like macros correctly.
Lastly, we aim to implement pthread_spin_lock using the RISC-V Atomic extension. Below is the C implementation:
And the RISC-V implementation:
B
Let's assume we are developing an operating system kernel, and now it's time to implement the context switching mechanism for RISC-V.
The generic code:
IRQ code:
C
Consider a complex task divided into four segments: PartA, PartB, PartC, and PartD. Each segment is assigned to a core in a quad-core processor system, where each core has its own dedicated cache. Cache coherence is maintained using a snoopy-based, write-invalidate MSI (Modified, Shared, Invalid) protocol, as depicted. The objective is to analyze the performance of this system and explore potential optimizations for improvement.
Each core begins by accessing and updating its local copy of the population data, stored under a designated label (e.g., pop_A
for the population of PartA). This process is illustrated in the pseudocode below:
The sequence of memory accesses for this process is as follows:
(1) lw a1, pop_A |
(2) lw a1, pop_B |
(3) lw a1, pop_C |
(4) lw a1, pop_D |
---|---|---|---|
(5) sw a1, pop_A |
(6) sw a1, pop_B |
(7) sw a1, pop_C |
(8) sw a1, pop_D |
Our focus is to closely analyze the activities occurring on Core A. Since the memory accesses are independent (i.e., pop_A
, pop_B
, pop_C
, and pop_D
correspond to distinct memory locations), we can concentrate solely on Core A's memory interactions to understand its behavior. Please complete the table below by filling in columns C01 and C02. For each entry, describe the bus transaction(s) triggered by each access and the resulting state of the cache line for pop_A
after the access.
Access | Shared bus transactions | Cache A |
---|---|---|
Initial state | - | pop_A: I |
lw a1, pop_A | BusRd | pop_A: C01 |
sw a1, pop_A | BusRdX | pop_A: C02 |
We are interested in exploring whether adopting the MESI protocol, as illustrated, could improve the performance of this processor. Assuming the four cores utilize a snoopy-based, write-invalidate MESI protocol, what would be the sequence of bus transactions and the cache line states for pop_A
on Core A? Please complete the table below by filling in columns C03 and C04.
Access | Shared bus transactions | Cache A |
---|---|---|
Initial state | - | pop_A: I |
lw a1, pop_A | BusRd | pop_A: C03 |
sw a1, pop_A | - | pop_A: C04 |
Each core then updates the overall census count, stored at a shared location labeled sum
. This process is illustrated in the following pseudocode:
The resulting sequence of memory accesses is as follows:
(1) lw a2, sum |
(3) lw a2, sum |
(4) lw a2, sum |
(6) lw a2, sum |
---|---|---|---|
(2) sw a2, sum |
(7) sw a2, sum |
(5) sw a2, sum |
(8) sw a2, sum |
We observe a critical issue in the current census counting implementation: there is a possibility that two processors may simultaneously read sum
, each add their local population count to it independently, and then write back to sum
, resulting in inaccuracies.
To ensure the final value of sum
accurately reflects the combined a0
values from all processors, semaphores should be integrated strategically to enforce mutual exclusion. These semaphores should minimize instruction blocking while preserving data correctness. Define and initialize any necessary semaphores, then insert WAIT
and SIGNAL
commands into the pseudocode below to ensure accurate updates to sum
. It is not required to include ecall
instructions for this modification.
Semaphore Initialization:
Pseudocode:
Specify the exact instruction, labeled as L1
to L7
, after which the WAIT(sum_sem)
command should be inserted to ensure exclusive access to sum
during critical updates, thus maintaining data accuracy.
Identify the exact instruction, labeled as L1
to L7
, after which the SIGNAL(sum_sem)
command should be placed. This ensures exclusive access to sum
during critical updates, maintaining data accuracy.
We are writing code to compute the greatest common divisor (GCD) using popcount.
Complete the above GCD implementation to make it work as expected.
Reference: Binary GCD algorithm
D
The diagram below illustrates a classic fully-bypassed 5-stage pipeline, enhanced with an unpipelined divider operating in parallel with the ALU. While bypass paths are not depicted, the iterative divider generates 2 bits per cycle until producing a complete 32-bit result.
In this pipeline design, asynchronous interrupts are handled during the MEM
stage and trigger a jump to a dedicated interrupt trap handler address. The interrupt latency is defined as the number of cycles from when an interrupt request is raised in the MEM
stage to when the first instruction of the interrupt handler reaches the MEM
stage.
Consider the following code execution scenario. An interrupt is raised during cycle 8, causing a jump to isr
. The label isr
refers to an Interrupt Service Routine, which increments a counter at a fixed memory address before resuming the original execution context. The architectural guarantee of precise interrupts must be preserved. Assume all memory accesses take one cycle in the MEM
stage.
Notably, the div
instruction in RISC-V does not raise data-dependent exceptions. To prevent pipeline stalls during a multi-cycle divide operation, the pipeline control logic permits subsequent instructions that are independent of the divide result to proceed and complete before the divide finishes execution.
The RISC-V instructions MRET
, HRET
, SRET
, and URET
are used to return from traps in M-mode, H-mode, S-mode, and U-mode, respectively. When executing an xRET
instruction:
xPP
.yIE
) is updated with the value in xPIE
.xPIE
field is set to 1.xPP
field is reset to U.Trap handlers typically save the current system state:
And restore the system state upon executing the MRET
instruction:
Reference: The RISC-V Instruction Set Manual: Volume II - Privileged Architecture
What is the interrupt latency for the code above? __ D01 __ cycles
To ensure the program continues executing correctly, which instruction should the isr
return to? Answer in one of the instructions shown above.
To reduce interrupt latency, we propose constraining the destination register for a div
instruction. The ABI could be modified to reserve a specific x
register to always store the divide result, allowing the interrupt handler to avoid using it. In a more generalized exception handler, reading this register could be deferred until the end of the context save, potentially hiding the divide latency.
Does this approach require any hardware modifications? __ D03 __ (Answer in Yes
or No
)
What is the corresponding limitation? __ D04 __ (Explain!)
E
Suppose you have been tasked with evaluating the performance of two processor designs: Processor1 and Processor2.
You plan to benchmark both processors using the following assembly sequence:
Processor1 is a 4-stage pipelined processor with full bypassing, designed to reduce the number of cycles required to execute each instruction. It combines the EXE
and MEM
stages, with load requests sent in the EXE
stage and received in the WB
stage one cycle later. The IF
stage speculatively fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a branch misprediction. Branches are resolved in the EXE
stage.
Processor2 is a 6-stage pipelined processor with full bypassing, designed to improve performance by increasing the clock speed. The EXE
stage is divided into two stages (EXE1
and EXE2
) to reduce the critical path through the ALU. The IF
stage always fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a taken branch. ALU and Branch ALU results become available in the EXE2
stage.
How many cycles does it take Processor1 to execute one iteration of the loop? __ E01 __ cycle(s)
How many cycles does it take Processor2 to execute one iteration of the loop? __ E02 __ cycle(s)
What is the clock period for Processor1? __ E03 __ ns per loop iteration