# Run FreeRTOS on VexRiscv and Measure/Tweak context switch overhead
###### tags: `Computer Architecture`
Github : [FreeRTOS-on-VexRiscv](https://github.com/chiangkd/FreeRTOS-on-VexRiscv)
## Requirement and Expectation
:::warning
1. Reproduce [Run FreeRTOS and multitasking on VexRiscv](https://hackmd.io/@oscarshiang/freertos_on_riscv) with [FreeRTOS 202212.00 released or latest](https://github.com/FreeRTOS/FreeRTOS)
2. Study [Reference Link](http://wiki.csie.ncku.edu.tw/embedded/arm-linux) to understand how to measure FreeRTOS context switch
3. Understand how to accurately measure cycle count and related latency
- Include timer interrupt.
4. Run above at simulator and quantify the performence about context switch
5. Study [Reducing Context Switching Overhead by Processor Architecture Modification](https://www.ripublication.com/aeee/20_pp%20155-162.pdf) and introduce the cost of reduction about context switch at the instruction level mechanism.
:::
## Reproduce past work
- Step-by-step detail in [Note](https://hackmd.io/@chiangkd/Bkhdj1iOi)
### Rebuid and Study Following two projects
- [Final Project: VexRiscv](https://hackmd.io/@oR8-QX4TQzGKDJ72DmqDUg/S1eKmCbkU#Final-Project-VexRiscv)
- [Final Project:Run FreeRTOS on VexRiscv and access the peripherals such as VGA.](https://hackmd.io/@4a740UnwQE6K9pc5tNlJpg/H1olFPOCD#Final-ProjectRun-FreeRTOS-on-VexRiscv-and-access-the-peripherals-such-as-VGA)
Use 3 terminal to show

- Upper right handle Briey SoC
- Upper left handle GDB server connect to the target
- Lower left handle OpenOCD connect to Briey SoC
Take **VGA** project in [VexRiscvSocSoftware](https://github.com/SpinalHDL/VexRiscvSocSoftware/tree/master/projects/briey/vga/src) for example.
Run
```
riscv64-unknown-elf-gdb ../VexRiscv/VexRiscvSocSoftware/projects/briey/vga/build/vga.elf
```
and type following command in GDB
```
(gdb)target remote localhost:3333
...
(gdb)monitor reset halt
...
(gdb)load
...
(gdb)continue
...
```

And see the RGB square in VGA GUI interface.
There are other projects for Briey in [VexRiscvSocSoftware](https://github.com/SpinalHDL/VexRiscvSocSoftware)
- dhrystone
- timer
- uart
- vga
## Rebuid and Study [Run FreeRTOS and multitasking on VexRiscv](https://hackmd.io/@oscarshiang/freertos_on_riscv#Run-FreeRTOS)
- This project choose to port Briey SoC to run FreeRTOS
- Installation and setup also follow [Final Project:Run FreeRTOS on VexRiscv and access the peripherals such as VGA.](https://hackmd.io/@4a740UnwQE6K9pc5tNlJpg/H1olFPOCD)
From the description
- `/source` contains the FreeRTOS source code
- `/Demo` contains a demo application for every official FreeRTOS port.
- `/Test` contains the tests performed on common code and the portable layer code.
### Use [RISC-V_RV32_QEMU_VIRT_GCC](https://github.com/FreeRTOS/FreeRTOS/tree/main/FreeRTOS/Demo/RISC-V_RV32_QEMU_VIRT_GCC) as a templete and modify it
>The virt board is a platform which does not correspond to any real hardware; it is designed for use in virtual machines. It is the recommended board type if you simply want to run a guest such as Linux and do not care about reproducing the idiosyncrasies and limitations of a particular bit of real-world hardware.
### Memory layout
We can find the memory capacity definition in `Briey.scala`
```scala
axiCrossbar.addSlaves(
ram.io.axi -> (0x80000000L, onChipRamSize),
sdramCtrl.io.axi -> (0x40000000L, sdramLayout.capacity),
apbBridge.io.axi -> (0xF0000000L, 1 MB)
)
```
- There are two storage space
- ram (onChipRam) with base address `0x80000000` and `onChipRamSize` (4kB)
- sdram with base address `0x40000000` and `sdramLayout.capaciry`
That's why the linker script (`<VexRiscvSOcSoftware>/projects/briey/libs/linker.ld`) in [VexRiscvSocSoftware
](https://github.com/SpinalHDL/VexRiscvSocSoftware) mention that
```
MEMORY {
onChipRam (W!RX)/*(RX)*/ : ORIGIN = 0x80000000, LENGTH = 4K
sdram (W!RX) : ORIGIN = 0x40000000, LENGTH = 64M
}
```
shows that
- Init assembly and stack place on-chip-ram `0x80000000` with `LENGTH = 4K`
- Text and data section place on SDRAM `0x40000000` with `LENGTH = 64M`
Modify and rename `fake_rom.lds` -> `linker.ld`
```diff
MEMORY
{
/* Fake ROM area */
- rom (rxa) : ORIGIN = 0x80000000, LENGTH = 512K
- ram (wxa) : ORIGIN = 0x80080000, LENGTH = 512K
+ onChipRam (rxa) : ORIGIN = 0x80000000, LENGTH = 4K
+ sdram (wxa) : ORIGIN = 0x40000000, LENGTH = 64M
}
```
### Set uart libarary
Copy the header file from [VexRiscvSocSoftware/libs](https://github.com/SpinalHDL/VexRiscvSocSoftware/tree/master/libs) (`gpio.h`, `interrupt.h`, `prescaler.h`, `timer.h`, `uart.h`, `vga.h`) to `RISC-V_VexRiscv-Briey_GCC/Vex_libs` and integrate [briey.h](https://github.com/SpinalHDL/VexRiscvSocSoftware/blob/master/projects/briey/libs/briey.h)(include several base address macro) in `gpio.h`
Rename `riscv-virt` to `riscv-briey` and modify uart port.
```diff
#include <FreeRTOS.h>
#include <string.h>
#include "riscv-briey.h"
- #include "ns16550.h"
+ #include "Vex_libs/uart.h"
```
And modify associated part.
```diff
void vSendString( const char *s )
{
- struct device dev;
- size_t i;
+ size_t i, len = strlen(s);
- dev.addr = NS16550_ADDR;
portENTER_CRITICAL();
- for (i = 0; i < strlen(s); i++) {
- vOutNS16550( &dev, s[i] );
- }
- vOutNS16550( &dev, '\n' );
+ for (i = 0; i < len; i++) {
+ uart_write(UART, s[i]);
+ }
+ uart_write(UART, '\n');
portEXIT_CRITICAL();
}
```
### Implement Briey timer and intergrate [uart](https://github.com/SpinalHDL/VexRiscvSocSoftware/blob/master/projects/briey/uart/src/main.c) and [timer](https://github.com/SpinalHDL/VexRiscvSocSoftware/blob/master/projects/briey/timer/src/main.c)
>Reference [Use Briey timer instead of machine timer](https://hackmd.io/qriUhWI2QEWeFbE_zKINrQ#Use-Briey-timer-instead-of-machine-timer)
`main.c`
```diff
int main( void )
{
int ret;
+ Uart_Config uartConfig;
+ uartConfig.dataLength = 8;
+ uartConfig.parity = NONE;
+ uartConfig.stop = ONE;
+ uartConfig.clockDivider = 50000000/8/115200-1;
+ uart_applyConfig(UART,&uartConfig);
- // trap handler initialization
- #if( mainVECTOR_MODE_DIRECT == 1 )
- {
- __asm__ volatile( "csrw mtvec, %0" :: "r"( freertos_risc_v_trap_handler ) );
- }
- #else
- {
- __asm__ volatile( "csrw mtvec, %0" :: "r"( ( uintptr_t )freertos_vector_table | 0x1 ) );
- }
- #endif
#if defined(DEMO_BLINKY)
ret = main_blinky();
#else
#error "Please add or select demo."
#endif
return ret;
}
```
- Reference [uart](https://github.com/SpinalHDL/VexRiscvSocSoftware/blob/master/projects/briey/uart/src/main.c) project.
Implement `vPortSetupTimerInterrupt()` which is marked as **weak** in `Source/portable/GCC/RISC-V/port.c`
- The frequency of the interrupt must equal the value of `configTICK_RATE_HZ` (which is defined in `FreeRTOSConfig.h`)
Implement `handle_trap()`, which is used to handle timer interrupts. It has to reset the PENDING bit for the timer and increase the system ticks
`riscv-briey.c`
```c
void handle_trap(void)
{
portENTER_CRITICAL();
xTaskIncrementTick();
TIMER_INTERRUPT->PENDINGS = 1;
portEXIT_CRITICAL();
}
void vPortSetupTimerInterrupt(void)
{
asm volatile("li t0, 0x1808\n\t"
"csrw mstatus, t0\n\t"
"li t0, 0x880\n\t"
"csrw mie, t0\n\t");
interruptCtrl_init(TIMER_INTERRUPT);
prescaler_init(TIMER_PRESCALER);
timer_init(TIMER_A);
TIMER_PRESCALER->LIMIT = 500;
TIMER_A->LIMIT = 1000;
TIMER_A->CLEARS_TICKS = 0x00010002;
TIMER_INTERRUPT->PENDINGS = 0xF;
TIMER_INTERRUPT->MASKS = 0x1;
}
```
- Notice that the `handle_trap()` function is mentioned by `-DportasmHANDLE_INTERRUPT=handle_trap` in Makefile, and this is **assembler** macro, not compiler macro.
**mstatus**

- `0x1808` set **MPP\[1:0\] = 11**, which is stand for machine mode, and also set **MIE = 1** (enable)
**mie**

- `0x880` set **MEIE** and **MTIE** to 1, which means interrupt enable for **machine level external** and **machine timer** interrupts.
- Reference [Privileged spec](https://riscv.org/technical/specifications/)
### Incompatible problem
```
ABI is incompatible with that of the selected emulation:
target emulation `elf32-littleriscv' does not match `elf64-littleriscv'
```
- And I find a familiar [issue](https://github.com/FreeRTOS/FreeRTOS/issues/830), and follow the discussion and figure it out.
- This problem is already fixed in [\#838](https://github.com/FreeRTOS/FreeRTOS/commit/cee9d5c560eb38664f20b9424a6d5b930b18f803)
### Memory not fit problem
```
/opt/riscv/lib/gcc/riscv64-unknown-elf/12.2.0/../../../../riscv64-unknown-elf/bin/ld: build/RTOSDemo.axf section `.text' will not fit in region `onChipRam'
/opt/riscv/lib/gcc/riscv64-unknown-elf/12.2.0/../../../../riscv64-unknown-elf/bin/ld: region `onChipRam' overflowed by 46800 bytes
```
- Most of the time, VMA and LMA will be the same. But in embedded system, this situation might change.
- Seems that `onChipRam` (4K) does not have enough capacity to store our code (text section).
- Put the `.text` and `.data` section to sdram (64M).
### Relocation truncated
```
build/start.o: in function `.L0 ':
/home/aaron/FreeRTOS-Briey-cs-latency/FreeRTOS/Demo/RISC-V-VexRiscv-Briey_GCC/start.S:68:(.init+0x62): relocation truncated to fit: R_RISCV_JAL against symbol `main' defined in .text.main section in build/main.o
```
- [Similarily problem](https://www.technovelty.org/c/relocation-truncated-to-fit-wtf.html)
```diff
- jal main
+ call main
```
can solve this problem
### Debugging
Copy `<FreeRTOS repo>/FreeRTOS/Demo/Common` into `<FreeRTOS-Briey>/Demo` which is not provided in [repo](https://github.com/OscarShiang/FreeRTOS-Briey).
:::info
**"Full" vs "Minimal" demo application files**
- `FreeRTOS/Demo/Common/Full` directory assume a hosted environment and are only used by demos that run on top of old DOS systems(which is also why the Partest.c filename is cryptic - it could only use short filenames in 8.3 format)
- `FreeRTOS/Demo/Common/Minimal` directory, none of which assume a hosted environment.
:::
Under `/FreeRTOS`
```
# open GDB console
$ cd <demo_directory>/RISC-V_VexRiscv-Briey_GCC
$ make DEBUG=1 # for debug symbol
$ riscv64-unknown-elf-gdb build/RTOSdemo.axf
```
### Compressed instruction stuck problem
I suffer from some unknown problem and the **gdb will stuck** and can not enter in `main` function.

Notice that the project is compile with `-march=rv32imac`, the compressed instrucion cause `nop` instruction into 2 bytes (in previous work set 7 `nop` instruction to push `0x80000020`), and I think there is another problem (no figure it out yet, but I think single step(`s`) in gdb will let `pc+4` at each step) that cause gdb stuck.
Also, `li` instruction is also compressed to 2 bytes. But in `riscv64-unknown-elf-gdb`, it can't load `a1` to `0` properly.
```
80000028: f1402573 csrr a0,mhartid
8000002c: 4581 li a1,0
8000002e: 06b51363 bne a0,a1,80000094 <secondary>
```
with gdb

And casue the next instruction `bne` jump to label `secondary`, which is handle for multicore, and it shouldn't be jump in my case.
So, I decide to build the project without compressed instruction.
Check multi-lib
```
riscv64-unknown-elf-gcc -print-multi-lib
```
```
.;
rv32i/ilp32;@march=rv32i@mabi=ilp32
rv32im/ilp32;@march=rv32im@mabi=ilp32
rv32iac/ilp32;@march=rv32iac@mabi=ilp32
rv32imac/ilp32;@march=rv32imac@mabi=ilp32
rv32imafc/ilp32f;@march=rv32imafc@mabi=ilp32f
rv64imac/lp64;@march=rv64imac@mabi=lp64
```
```diff
-CFLAGS = -march=rv32imac -mabi=ilp32 -mcmodel=medany \
+CFLAGS = -march=rv32im -mabi=ilp32 -mcmodel=medany \
-Wall \
-fmessage-length=0 \
-ffunction-sections \
-fdata-sections \
-fno-builtin-printf
LDFLAGS = -nostartfiles -Tlinker.ld \
- -march=rv32imac -mabi=ilp32 -mcmodel=medany \
+ -march=rv32im -mabi=ilp32 -mcmodel=medany \
-Xlinker --gc-sections \
-Xlinker --defsym=__stack_size=300 \
-Xlinker -Map=RTOSDemo.map
```
- And the gdb will run properly
:::warning
But this method just abort the compressed instruction.
:::
### Configure BrieySoc to fit the compressed instruction (RVC)
There are some discusses about error and incompatible feature and some project with RVC compatible implementation.
- [RV32IMC with IBusCachedPlugin #93](https://github.com/SpinalHDL/VexRiscv/issues/93)
- [Briey Soc: Framebuffer display flickered #198](https://github.com/SpinalHDL/VexRiscv/issues/198)
**Some similar project**
- [ECP5_Brieysoc](https://github.com/jmio/ECP5_Brieysoc)
- [VexRiscv-verilog](https://github.com/xobs/VexRiscv-verilog)
And follow the cofiguration above, modify `IBusCachedPlugin`
```diff
new IBusCachedPlugin(
resetVector = 0x80000000l,
prediction = STATIC,
+ compressedGen = true, // for compressed instruction
config = InstructionCacheConfig(
cacheSize = 4096,
bytePerLine =32,
wayCount = 1,
addressWidth = 32,
cpuDataWidth = 32,
memDataWidth = 32,
catchIllegalAccess = true,
catchAccessFault = true,
asyncTagMemory = false,
- twoCycleRam = true,
- twoCycleCache = true
+ twoCycleRam = false, // true -> false, for compressed instruction
+ twoCycleCache = false // true -> false, for compressed instruction
)
```
- Rebuid Briey SoC and It can run properly with compressed instruction without modifying `Makefile`
Furthermore, remember that executing `ecall` will trigger a SWI and jump to the address of `mtvec`, which is `0x80000020`. And that's why [Oscar](https://hackmd.io/qriUhWI2QEWeFbE_zKINrQ#Replace-vPortYield-with-ecall) add serveral `nop` instructions to push the `main_entry` to `0x80000020`.
But in RVC, `nop` was compressed to 2bytes, that means I need more `nop` instructions (2 times) to do the same thing.
```diff
_init:
j _start
nop
nop
nop
nop
nop
nop
nop
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop
.globl main_entry
main_entry:
la a0, freertos_risc_v_trap_handler
jr a0
```
### Linker script `lma` problem
There is a function in `portASM.S` to trap exception and view the decription.
```c
freertos_risc_v_application_exception_handler:
csrr t0, mcause /* For viewing in the debugger only. */
csrr t1, mepc /* For viewing in the debugger only */
csrr t2, mstatus /* For viewing in the debugger only */
j .
```
run`(gdb)info register` to check the `mcause` value

- `macuse` = 0x5, stand for Load access fault
- `mepc` = 0x80000074, stand for `80000074: 00052283 lw t0,0(a0)`
- `mstatus` = 0x1800
Check the disassembly
```
80000060: a4c50513 addi a0,a0,-1460 # 80000aa8 <_bss_lma>
80000064: c000c597 auipc a1,0xc000c
80000068: 6a458593 addi a1,a1,1700 # 4000c708 <impure_data>
8000006c: 82418613 addi a2,gp,-2012 # 4000d0cc <_bss>
80000070: 00c5fc63 bgeu a1,a2,80000088 <_start+0x48>
80000074: 00052283 lw t0,0(a0)
```
In assembly code is `la a0, _data_lma`, but in disassembly, is link to `_bss_lma`.
Put `_data_lma` lma same to vma.
```diff
{
_data_lma = LOADADDR(.data.start);
- } >sdram AT>onChipRam
+ } >sdram AT>sdram
```
Now, it can successfully enter in `main` function. But trap to `freertos_risc_v_application_exception_handler` again.

```
40000254: 70f020ef jal ra,40003162 <strlen>
```

- Problem [here](https://sourceware.org/bugzilla/show_bug.cgi?id=27999)
```
riscv64-unknown-elf-gdb --version
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
```
- Update gdb by reinstalling [riscv-gnu-toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain)
```
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
```

- Now can successfully show the string.
The bug is cause by setting `CFLAGS` with `-march=rv32ima` and `LDFLAGS` with `-march=rv32imac` and both with `-mabi=ilp32`
- `CFLAGS` is used to compile code
- `LDFLAGS` is used to link with linker script
It fixed when I used modified BrieySoc (with compressed instruction).
### Illegal instruction problem

- `t0` = 2 stand for illegal instruction, reference [previous work](https://hackmd.io/@oscarshiang/freertos_on_riscv#Illegal-instruction)
Change `Briey.scala` setting
```diff
new CsrPlugin(
config = CsrPluginConfig(
catchIllegalAccess = false,
mvendorid = null,
marchid = null,
mimpid = null,
mhartid = null,
misaExtensionsInit = 66,
misaAccess = CsrAccess.NONE,
mtvecAccess = CsrAccess.NONE,
mtvecInit = 0x80000020l,
mepcAccess = CsrAccess.READ_WRITE,
mscratchGen = false,
mcauseAccess = CsrAccess.READ_ONLY,
mbadaddrAccess = CsrAccess.READ_ONLY,
mcycleAccess = CsrAccess.NONE,
minstretAccess = CsrAccess.NONE,
- ecallGen = false,
+ ecallGen = true,
wfiGenAsWait = false,
ucycleAccess = CsrAccess.NONE,
uinstretAccess = CsrAccess.NONE
)
```
And it can succefully enter the task.

### Machine timer interrupt
But, still trap in exception

- Which means **Machine timer interrupt**
But, if I use the `GCC/RISC-V/portASM.S` in **FreeRTOS Kernel V10.4.6**, It can successfully run context switch.
In previous version, `portASM.S` use `portasmHANDLE_INTERRUPT` for handling external interrupts. But in the latest version (**FreeRTOS Kernel V10.5.1**), the code was re-factoring. And `portasmHANDLE_INTERRUPT` disappear.
- Detail in [#444](https://github.com/FreeRTOS/FreeRTOS-Kernel/commit/9efca75d1ebfc6c02f9e004c199dffa327267a09#diff-391eed95c9ce193746aac395a24167a2fec2fb1978a599d3455e9d3e94e2ac14)
[Using FreeRTOS on RISC-V Microcontrollers](https://www.freertos.org/Using-FreeRTOS-on-RISC-V.html) mention that to build FreeRTOS for a RISC-V core need:
1. Include the core FreeRTOS source files and the FreeRTOS RISC-V port layer source files in your project.
2. Ensure the assembler's include path includes the path to the header file that describes any chip specific implementation details.
3. Define either a constant in FreeRTOSConfig.h or a linker variable to specify the memory to use as the interrupt stack.
4. Define configMTIME_BASE_ADDRESS and configMTIMECMP_BASE_ADDRESS in FreeRTOSConfig.h.
5. For the assembler, #define portasmHANDLE_INTERRUPT to the name of the function provided by your chip or tools vendor for handling external interrupts.
6. Install the FreeRTOS trap handler.
So, I change the file to use external interrupts.
`portASM.S`
```diff
asynchronous_interrupt:
store_x a1, 0( sp ) /* Asynchronous interrupt so save unmodified exception return address. */
load_x sp, xISRStackTop /* Switch to ISR stack. */
- j handle_interrupt
+ j portasmHANDLE_INTERRUPT /* Handle external interrupts */
...
handle_exception:
/* a0 contains mcause. */
li t0, 11 /* 11 == environment call. */
- bne a0, t0, application_exception_handler /* Not an M environment call, so some other exception. */
+ bne a0, t0, portasmHANDLE_INTERRUPT /* Handle external interrupts */
call vTaskSwitchContext
j processed_source
```
```
Hello FreeRTOS!
0: Tx: Transfer1
0: Tx: Transfer2
0: Rx: Blink1
0: Tx: Transfer1
0: Rx: Blink2
0: Tx: Transfer2
```
For some unknown problem, the results doesn't show as I think. The result should be show as below
```
Hello FreeRTOS!
0: Tx: Transfer1
0: Rx: Blink1
0: Tx: Transfer2
0: Rx: Blink2
0: Tx: Transfer1
0: Rx: Blink1
0: Tx: Transfer2
0: Rx: Blink2
```
which is shown in [previous work](https://hackmd.io/@oscarshiang/freertos_on_riscv)
Noticed that in previous work, `configUSE_PREEMPTION` is set to 0, which means cooperative scheduling. So I just add `portYIELD();` in the end of the `sender` and `receiver` task.
And the context switch will run properly as I think.
## Cycle count and Latency in VexRiscv
In `VexRiscv.scala`
Define a 5-stage pipeline (Fetch, Decode, Execute, Memory, WriteBack)
```scala
def newStage(): Stage = { val s = new Stage; stages += s; s }
val decode = newStage()
val execute = newStage()
val memory = ifGen(config.withMemoryStage) (newStage())
val writeBack = ifGen(config.withWriteBackStage) (newStage())
```
and Fetch stage has with RVC or not option
```scala
def withRvc = plugins.find(_.isInstanceOf[IBusFetcher]) match {
case Some(x) => x.asInstanceOf[IBusFetcher].withRvc
case None => false
}
```
which is utilized in `Services.scala`
```scala
trait IBusFetcher{
def haltIt() : Unit
def incoming() : Bool
def pcValid(stage : Stage) : Bool
def getInjectionPort() : Stream[Bits]
def withRvc() : Boolean
def forceNoDecode() : Unit
}
```
And there is serveral regression test that give clock cycle
In `src/test/cpp/regression/main.cpp`, we can find how regression test work.
There is a [ready-valid handshake protocol]() implemented
```cpp
for(SimElement* simElement : simElements) simElement->preCycle();
dump(i + 1);
checks();
//top->eval();
top->clk = 1;
top->eval();
instanceCycles += 1;
for(SimElement* simElement : simElements) simElement->postCycle();
```
Notice that `instanceCycles` was enclosed by two for loop `preCycle()` and `postCycle()`
And these two function definition is depended on IBUS configuration. Take `IBUS_SIMPLE` as example
```cpp
virtual void preCycle(){
if (top->iBus_cmd_valid && top->iBus_cmd_ready) {
//assertEq(top->iBus_cmd_payload_pc & 3,0);
pendings[wPtr] = (top->iBus_cmd_payload_pc);
wPtr = (wPtr + 1) & 0xFF;
//ws->iBusAccess(top->iBus_cmd_payload_pc,&inst_next,&error_next);
}
}
//TODO doesn't catch when instruction removed ?
virtual void postCycle(){
top->iBus_rsp_valid = 0;
if(rPtr != wPtr && (!ws->iStall || VL_RANDOM_I_WIDTH(7) < 100)){
uint32_t inst_next;
bool error_next;
ws->iBusAccess(pendings[rPtr], &inst_next,&error_next);
rPtr = (rPtr + 1) & 0xFF;
top->iBus_rsp_payload_inst = inst_next;
top->iBus_rsp_valid = 1;
top->iBus_rsp_payload_error = error_next;
} else {
top->iBus_rsp_payload_inst = VL_RANDOM_I_WIDTH(32);
top->iBus_rsp_payload_error = VL_RANDOM_I_WIDTH(1);
}
if(ws->iStall) top->iBus_cmd_ready = VL_RANDOM_I_WIDTH(7) < 100;
}
```
It will call every `simElement->preCycle()` and check if `iBus_cmd_valid` and `iBus_cmd_ready` all both in high to make sure data transfer correctely.
For example
- **Slow writer(sender) and fast reader(receiver)**

- **Fast writer(sender) and slow reader(receiver)**

Under `/regression`, use the command
```
make clean run IBUS=SIMPLE DBUS=SIMPLE CSR=no MMU=no DEBUG_PLUGIN=no MUL=no DIV=no TRACE=yes
```
by giving `TRACE=yes`, it will generate `.fst` file and we can observe it by GTKWave.
Take `rv32ui-p-xor.fst` for example, which is one of the regression test.

- When `valid` and `ready` are both 1, the `pc` will plus 4 at the next clock rising edge. Make sure the PC increment with instructions transfer properly.
And the `check()` function
```cpp
virtual void checks(){
if(top->VexRiscv->lastStageRegFileWrite_valid == 1 && top->VexRiscv->lastStageRegFileWrite_payload_address != 0){
assertEq(top->VexRiscv->lastStageRegFileWrite_payload_address, regFileWriteRefArray[regFileWriteRefIndex][0]);
assertEq(top->VexRiscv->lastStageRegFileWrite_payload_data, regFileWriteRefArray[regFileWriteRefIndex][1]);
//printf("%d\n",i);
regFileWriteRefIndex++;
if(regFileWriteRefIndex == sizeof(regFileWriteRefArray)/sizeof(regFileWriteRefArray[0])){
pass();
}
}
}
```
- check the regfile it writable (valid bit = 1) and associated payload address is not 0.
- and check user `assertEq`, which will throw an exception if two input is not equal
- `payload_address` and `regFileWriteRefIndex`
- `payload_data` and `WriteRef`

- Which also can observe in GTKWave
If above two function pass, `instanceCycles += 1;`
And the `postCycle()` function I think it is related to GDB. (no figure out yet.)
### Timer interrupt
```cpp
#ifdef TIMER_INTERRUPT
top->timerInterrupt = mTime >= mTimeCmp ? 1 : 0;
//if(mTime == mTimeCmp) printf("SIM timer tick\n");
#endif
```
`mtime` will increment and if `mTime` greater than `mTimeCmp`, it will generate `timerInterrupt`
- Still not figure out how timer interrupt behavior in cycle count.
## Context Switch performance
From [Paper](https://ieeexplore.ieee.org/abstract/document/467697) and [Reference Link](http://wiki.csie.ncku.edu.tw/embedded/arm-linux#context-switch-latency-%E7%90%86%E8%AB%96%E8%88%87%E5%AF%A6%E9%9A%9B%E7%9A%84%E7%B5%90%E5%90%88), construct a test method
| | Paper | Briey Soc |
| --- | ------------- | --------------------------------- |
| C | Cache size | 4096 byte |
| N | Array size | (process size) x (process number) |
| s | stride | (process size) |
| b | line size | 32 byte |
| a | associativity | 1 way |
And associated definition can find in `Briey.scala`
```scala
cpuPlugins = ArrayBuffer(
new PcManagerSimplePlugin(0x80000000l, false),
// new IBusSimplePlugin(
// interfaceKeepData = false,
// catchAccessFault = true
// ),
new IBusCachedPlugin(
resetVector = 0x80000000l,
prediction = STATIC,
compressedGen = true, // for compressed instruction
config = InstructionCacheConfig(
cacheSize = 4096,
bytePerLine =32,
wayCount = 1,
addressWidth = 32,
cpuDataWidth = 32,
memDataWidth = 32,
catchIllegalAccess = true,
catchAccessFault = true,
asyncTagMemory = false,
twoCycleRam = false, // true -> false, for compressed instruction
twoCycleCache = false // true -> false, for compressed instruction
)
// askMemoryTranslation = true,
// memoryTranslatorPortConfig = MemoryTranslatorPortConfig(
// portTlbSize = 4
// )
),
// new DBusSimplePlugin(
// catchAddressMisaligned = true,
// catchAccessFault = true
// ),
new DBusCachedPlugin(
config = new DataCacheConfig(
cacheSize = 4096,
bytePerLine = 32,
wayCount = 1,
addressWidth = 32,
cpuDataWidth = 32,
memDataWidth = 32,
catchAccessError = true,
catchIllegal = true,
catchUnaligned = true
),
memoryTranslatorPortConfig = null
// memoryTranslatorPortConfig = MemoryTranslatorPortConfig(
// portTlbSize = 6
// )
),
```
And the paper also mentioned several regimes with different size setting to measure time per iteration with or without cache miss
| Regime | Size of Array | Stride | Frequency of Misses | Time per Iteration |
| ------ | ----------------- | --------------------- | ----------------------------- | -------------------- |
| 1 | $1 \leq N \leq C$ | $1 \leq s \leq N/2$ | no misses | $T_{no-miss}$ |
| 2.a | $C \lt N$ | $1 \leq s \lt N/2$ | one miss every $b/s$ elements | $T_{no-miss} + Ds/b$ |
| 2.b | $C \lt N$ | $b \leq s \lt N/a$ | one miss every element | $T_{no-miss} + D$ |
| 2.c | $C \lt N$ | $N/a \leq s \leq N/2$ | no misses | $T_{no-miss}$ |
- No misses are generated when $N \leq C$. When $N \gt C$, the rate of misses is determined by the stride between consecutive elements. D is the delay penalty
In [reference](http://wiki.csie.ncku.edu.tw/embedded/arm-linux#context-switch-latency-%E6%B8%AC%E8%A9%A6%E7%90%86%E8%AB%96), first simply use an array to test the influence about cache miss.
In Briey, we have 4096 byte cache size, by setting `configMINIMAL_STACK_SIZE = 128` and use a function to create number of tasks.
```c
static void createTask(int num_tasks)
{
char task_name[10];
for(int i = 0; i < num_tasks; i++) {
sprintf(task_name, "Task_%i", i);
xTaskCreate(task, task_name, configMINIMAL_STACK_SIZE * 2U, i, tskIDLE_PRIORITY + 1, NULL);
}
}
```
And in the `task` handler, use `xTaskGetTickCount` to find the current tick count.
```c
static void task(void *pvParameters)
{
// TickType_t xNextWakeTime;
( void ) pvParameters;
for(;;)
{
end = xTaskGetTickCount();
TickType_t ticks = end - start;
char ee[40];
char ss[40];
sprintf(ee, "etime = %i", end);
vSendString(ee);
char buf[40];
sprintf(buf, "Hello I'm task %i", pvParameters);
vSendString(buf);
start = xTaskGetTickCount();
sprintf(ss, "stime = %i", start);
vSendString(ss);
portYIELD();
}
}
```
```
Hello I'm task 0
stime = 0
etime = 0
Hello I'm task 1
stime = 0
etime = 0
Hello I'm task 2
stime = 0
etime = 0
Hello I'm task 3
stime = 1
etime = 1
Hello I'm task 4
stime = 1
etime = 1
Hello I'm task 5
stime = 1
etime = 1
Hello I'm task 6
stime = 2
etime = 2
Hello I'm task 7
stime = 2
etime = 2
Hello I'm task 8
stime = 2
etime = 2
Hello I'm task 9
stime = 2
etime = 3
Hello I'm task 10
stime = 3
etime = 3
Hello I'm task 11
stime = 3
etime = 3
```
Try to use `etime - stime` to measure the context switch latency, but it seems not work (`stime` and `etime` are the same value at the most time)
==**Implementation is still working...**==
## Reducing Context Switching Overhead by Processor Architecture Modification
- [Article](https://www.ripublication.com/aeee/20_pp%20155-162.pdf)
The process of context switching involves storing the context (state) of the current executing task in to the stack and restoring the context of the task to be executed from the stack, the context includes
- Program status word
- Program counter
- Stack pointer
- temporary register values
- data related to the next task
And the overhead can be reduced by migrating kernel services such as scheduling, time tick processing, and interrupt handling to **hardware.**
This paper is aim to reduce the effect of context switch overhead. One of the method is specializing certain register to a thread, which can eliminate the need for saving and restoring of context, but on the contrary it will reduce the number of registers available for other threads.
This paper provides a mothod that reduce the overhead by **restricting the use of memory during context-switching by adding register file to the procerssor.** This makes the process to compute at much faster rate therby reducin the overhead.
The implementation involves
1. **Hardware:** Modify processor architecture and hardware implementation of instructions (`scxt, rcxt`)
2. **Software:** Modify GNU MIPS assembler for adding two new instructions.
### Hardware Modification
First, modify orginal [Plasma MIPS](https://opencores.org/projects/plasma) design **by implementing all the "reg_bank" register in FPGA's logic blocks.** In order to save context registers on a register file in one CPU clock cycle.
Furthermore, 4 additional register files are implemented with each register file holds 12 context registers.
- 9 saved or temporary registers
- frame pointer register
- global pointer register
- stack pointer register
**Modified MIPS Architecture** (<font color=#ff00>the red part is modified part</font>)

Next, implementint two context-switching instructions (`scxt`, `rcxt`) to access these register files for storing and restoring the context durin context-switching operation.
### Software Modification
First, developing a co-operative operating system involing basic function for
- Initializing the OS
- Creating tasks
- Scheduling the tasks
Next, modify GCC compiler to aware of the newly added instruction, which inturn used to compile the MIPS C files. The instruction are specified in [GNU binutils](https://www.gnu.org/software/binutils/), and the file [`mips-opc.c`](https://chromium.googlesource.com/chromiumos/third_party/gdb/+/refs/heads/master/opcodes/mips-opc.c) in folder contains all the instructions supported by the MIPS processor. Like this.
```diff
/* These instructions appear first so that the disassembler will find
them first. The assemblers uses a hash table based on the
instruction name anyhow. */
/* name, args, match, mask, pinfo, pinfo2, membership, ase, exclusions */
{"pref", "k,o(b)", 0xcc000000, 0xfc000000, RD_3|LM, 0, I4_32|G3, 0, 0 },
{"pref", "k,A(b)", 0, (int) M_PREF_AB, INSN_MACRO, 0, I4_32|G3, 0, 0 },
{"prefx", "h,t(b)", 0x4c00000f, 0xfc0007ff, RD_2|RD_3|FP_S|LM, 0, I4_33, 0, 0 },
{"nop", "", 0x00000000, 0xffffffff, 0, INSN2_ALIAS, I1, 0, 0 }, /* sll */
{"ssnop", "", 0x00000040, 0xffffffff, 0, INSN2_ALIAS, I1, 0, 0 }, /* sll */
{"ehb", "", 0x000000c0, 0xffffffff, 0, INSN2_ALIAS, I1, 0, 0 }, /* sll */
{"li", "t,j", 0x24000000, 0xffe00000, WR_1, INSN2_ALIAS, I1, 0, 0 }, /* addiu */
{"li", "t,i", 0x34000000, 0xffe00000, WR_1, INSN2_ALIAS, I1, 0, 0 }, /* ori */
{"li", "t,I", 0, (int) M_LI, INSN_MACRO, 0, I1, 0, 0 },
{"move", "d,s", 0, (int) M_MOVE, INSN_MACRO, 0, I1, 0, 0 },
{"move", "d,s", 0x0000002d, 0xfc1f07ff, WR_1|RD_2, INSN2_ALIAS, I3, 0, 0 },/* daddu */
{"move", "d,s", 0x00000021, 0xfc1f07ff, WR_1|RD_2, INSN2_ALIAS, I1, 0, 0 },/* addu */
{"move", "d,s", 0x00000025, 0xfc1f07ff, WR_1|RD_2, INSN2_ALIAS, I1, 0, 0 },/* or */
{"b", "p", 0x10000000, 0xffff0000, UBD, INSN2_ALIAS, I1, 0, 0 },/* beq 0,0 */
{"b", "p", 0x04010000, 0xffff0000, UBD, INSN2_ALIAS, I1, 0, 0 },/* bgez 0 */
{"bal", "p", 0x04110000, 0xffff0000, WR_31|UBD, INSN2_ALIAS, I1, 0, 0 },/* bgezal 0*/
+{"scxt", "t", 0x0000003C, 0xffffffff, RD_t, 0 1, 0, 0 },/* scxt */
+{"rcxt", "t", 0x0000003D, 0xffffffff, RD_t, 0 1, 0, 0 },/* scxt */
```
makes GCC compiler compatible to modified architecture.
### Applications & Results (in paper)
First application contain four tasks created using `createTask()`, the firset task deals with incrementing variables followed by adding them. Storing the result in sum variable and finally diplaying number of clock cycles.
Second application comprises of four tasks, two of them are structured to undergo fast context switchin using **internal register files** and the other two tasks are structured to undergo context switching using **external RAM**
### Conclusion (in paper)
Restricting the process of context-switching to the processir itself by modifing CPU architecture, without having the external memory access the context of tasks, rules out the extra time consumption.
## Reference
- [2020 Final Project: VexRiscv](https://hackmd.io/@oR8-QX4TQzGKDJ72DmqDUg/S1eKmCbkU)
- [2021 Run FreeRTOS and multitasking on VexRiscv](https://hackmd.io/@oscarshiang/freertos_on_riscv)
- [Final Project:Run FreeRTOS on VexRiscv and access the peripherals such as VGA.](https://hackmd.io/@4a740UnwQE6K9pc5tNlJpg/H1olFPOCD)
- [‘virt’ Generic Virtual Platform (virt)](https://www.qemu.org/docs/master/system/riscv/virt.html)
- [[FreeRTOS] Using FreeRTOS on RISC-V Microcontrollers](https://www.twblogs.net/a/5d40c40cbd9eee51fbf9af02)
- [Using FreeRTOS on RISC-V Microcontrollers](https://www.freertos.org/Using-FreeRTOS-on-RISC-V.html)
- [An Introduction to the RISC-V Architecture](https://cdn2.hubspot.net/hubfs/3020607/An%20Introduction%20to%20the%20RISC-V%20Architecture.pdf)
- [3.10 Options for Debugging Your Program](https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html)
- [Linker Script VMA and LMA example](https://github.com/mlouielu/linker_script_vma_lma_example)
- [The linker’s warnings about executable stacks and segments](https://www.redhat.com/en/blog/linkers-warnings-about-executable-stacks-and-segments)
- [spinal](https://javadoc.io/doc/com.github.spinalhdl/spinalhdl-lib_2.11/1.2.2/spinal/package.html)
- [spinal.core](https://javadoc.io/doc/com.github.spinalhdl/spinalhdl-lib_2.11/1.2.2/spinal/core/package.html)
- [spinal.lib](https://javadoc.io/doc/com.github.spinalhdl/spinalhdl-lib_2.11/1.2.2/spinal/lib/package.html)
- [CPU Time](https://chi_gitbook.gitbooks.io/personal-note/content/cpu_time.html)
- [Understanding Hardware Counter CPU Cycles Profiling Metrics](https://docs.oracle.com/cd/E77782_01/html/E77799/gpayk.html)
- [Understanding Cache Contention and Cache Profiling Metrics](https://docs.oracle.com/cd/E77782_01/html/E77799/gpayf.html#scrolltoc)
- [mhz: Anatomy of a micro-benchmark](https://www.usenix.org/legacy/publications/library/proceedings/lisa97/failsafe/usenix98/full_papers/staelin/staelin_html/staelin.html)
- NCKU Wiki
- [ARM-Linux](http://wiki.csie.ncku.edu.tw/embedded/arm-linux#context-switch-latency-%E7%90%86%E8%AB%96%E8%88%87%E5%AF%A6%E9%9A%9B%E7%9A%84%E7%B5%90%E5%90%88)
- [FreeRTOS](http://wiki.csie.ncku.edu.tw/embedded/freertos#%E6%95%88%E8%83%BD%E8%A9%95%E4%BC%B0)
- [Measuring cache and TLB performance and their effect on benchmark runtimes](https://ieeexplore.ieee.org/abstract/document/467697)
- [GDBWave - A Post-Simulation Waveform-Based RISC-V GDB Debugging Server](https://tomverbeure.github.io/2022/02/20/GDBWave-Post-Simulation-RISCV-SW-Debugging.html)
- [FreeRTOS Study](https://hackmd.io/@yyp/Syoh_YmrU)