Try   HackMD

Run FreeRTOS on VexRiscv and Measure/Tweak context switch overhead

tags: Computer Architecture

Github : FreeRTOS-on-VexRiscv

Requirement and Expectation

  1. Reproduce Run FreeRTOS and multitasking on VexRiscv with FreeRTOS 202212.00 released or latest
  2. Study Reference Link to understand how to measure FreeRTOS context switch
  3. Understand how to accurately measure cycle count and related latency
    • Include timer interrupt.
  4. Run above at simulator and quantify the performence about context switch
  5. Study Reducing Context Switching Overhead by Processor Architecture Modification and introduce the cost of reduction about context switch at the instruction level mechanism.

Reproduce past work

  • Step-by-step detail in Note

Rebuid and Study Following two projects

Use 3 terminal to show

  • Upper right handle Briey SoC
  • Upper left handle GDB server connect to the target
  • Lower left handle OpenOCD connect to Briey SoC

Take VGA project in VexRiscvSocSoftware for example.

Run

riscv64-unknown-elf-gdb ../VexRiscv/VexRiscvSocSoftware/projects/briey/vga/build/vga.elf

and type following command in GDB

(gdb)target remote localhost:3333
...
(gdb)monitor reset halt
...
(gdb)load
...
(gdb)continue
...


And see the RGB square in VGA GUI interface.

There are other projects for Briey in VexRiscvSocSoftware

  • dhrystone
  • timer
  • uart
  • vga

Rebuid and Study Run FreeRTOS and multitasking on VexRiscv

From the description

  • /source contains the FreeRTOS source code
  • /Demo contains a demo application for every official FreeRTOS port.
  • /Test contains the tests performed on common code and the portable layer code.

Use RISC-V_RV32_QEMU_VIRT_GCC as a templete and modify it

The virt board is a platform which does not correspond to any real hardware; it is designed for use in virtual machines. It is the recommended board type if you simply want to run a guest such as Linux and do not care about reproducing the idiosyncrasies and limitations of a particular bit of real-world hardware.

Memory layout

We can find the memory capacity definition in Briey.scala

axiCrossbar.addSlaves(
      ram.io.axi       -> (0x80000000L,   onChipRamSize),
      sdramCtrl.io.axi -> (0x40000000L,   sdramLayout.capacity),
      apbBridge.io.axi -> (0xF0000000L,   1 MB)
    )
  • There are two storage space
    • ram (onChipRam) with base address 0x80000000 and onChipRamSize (4kB)
    • sdram with base address 0x40000000 and sdramLayout.capaciry

That's why the linker script (<VexRiscvSOcSoftware>/projects/briey/libs/linker.ld) in VexRiscvSocSoftware
mention that

MEMORY {
  onChipRam (W!RX)/*(RX)*/ : ORIGIN = 0x80000000, LENGTH = 4K
  sdram (W!RX) : ORIGIN = 0x40000000, LENGTH = 64M
}

shows that

  • Init assembly and stack place on-chip-ram 0x80000000 with LENGTH = 4K
  • Text and data section place on SDRAM 0x40000000 with LENGTH = 64M

Modify and rename fake_rom.lds -> linker.ld

MEMORY
{
	/* Fake ROM area */
-	rom (rxa) : ORIGIN = 0x80000000, LENGTH = 512K
-	ram (wxa) : ORIGIN = 0x80080000, LENGTH = 512K
+	onChipRam (rxa) : ORIGIN = 0x80000000, LENGTH = 4K
+	sdram (wxa) : ORIGIN = 0x40000000, LENGTH = 64M
}

Set uart libarary

Copy the header file from VexRiscvSocSoftware/libs (gpio.h, interrupt.h, prescaler.h, timer.h, uart.h, vga.h) to RISC-V_VexRiscv-Briey_GCC/Vex_libs and integrate briey.h(include several base address macro) in gpio.h

Rename riscv-virt to riscv-briey and modify uart port.

#include <FreeRTOS.h>

#include <string.h>

#include "riscv-briey.h"
- #include "ns16550.h"
+ #include "Vex_libs/uart.h"

And modify associated part.

void vSendString( const char *s )
{
-	struct device dev;
-	size_t i;
+	size_t i, len = strlen(s);

-	dev.addr = NS16550_ADDR;

	portENTER_CRITICAL();

-	for (i = 0; i < strlen(s); i++) {
-		vOutNS16550( &dev, s[i] );
-	}
-	vOutNS16550( &dev, '\n' );
	
+	for (i = 0; i < len; i++) {
+		uart_write(UART, s[i]);
+	}
+	uart_write(UART, '\n');

	portEXIT_CRITICAL();
}

Implement Briey timer and intergrate uart and timer

Reference Use Briey timer instead of machine timer

main.c

int main( void )
{
	int ret;
+	Uart_Config uartConfig;
+	uartConfig.dataLength = 8;
+	uartConfig.parity = NONE;
+	uartConfig.stop = ONE;
+	uartConfig.clockDivider = 50000000/8/115200-1;
+	uart_applyConfig(UART,&uartConfig);

-	// trap handler initialization
-	 	#if( mainVECTOR_MODE_DIRECT == 1 )
-	 {
-	 	__asm__ volatile( "csrw mtvec, %0" :: "r"( freertos_risc_v_trap_handler ) );
-	 }
-	 #else
-	 {
-	 	__asm__ volatile( "csrw mtvec, %0" :: "r"( ( uintptr_t )freertos_vector_table | 0x1 ) );
-	 }
-	 #endif
#if defined(DEMO_BLINKY)
	ret = main_blinky();
#else
#error "Please add or select demo."
#endif

	return ret;
}
  • Reference uart project.

Implement vPortSetupTimerInterrupt() which is marked as weak in Source/portable/GCC/RISC-V/port.c

  • The frequency of the interrupt must equal the value of configTICK_RATE_HZ (which is defined in FreeRTOSConfig.h)

Implement handle_trap(), which is used to handle timer interrupts. It has to reset the PENDING bit for the timer and increase the system ticks

riscv-briey.c

void handle_trap(void)
{
	portENTER_CRITICAL();

	xTaskIncrementTick();
	TIMER_INTERRUPT->PENDINGS = 1;

	portEXIT_CRITICAL();
}

void vPortSetupTimerInterrupt(void)
{
        asm volatile("li t0, 0x1808\n\t"
		     "csrw mstatus, t0\n\t"
		     "li t0, 0x880\n\t"
		     "csrw mie, t0\n\t");

        interruptCtrl_init(TIMER_INTERRUPT);
        prescaler_init(TIMER_PRESCALER);
        timer_init(TIMER_A);

        TIMER_PRESCALER->LIMIT = 500;

        TIMER_A->LIMIT = 1000;
        TIMER_A->CLEARS_TICKS = 0x00010002;

        TIMER_INTERRUPT->PENDINGS = 0xF;
        TIMER_INTERRUPT->MASKS = 0x1;
}
  • Notice that the handle_trap() function is mentioned by -DportasmHANDLE_INTERRUPT=handle_trap in Makefile, and this is assembler macro, not compiler macro.

mstatus

  • 0x1808 set MPP[1:0] = 11, which is stand for machine mode, and also set MIE = 1 (enable)
    mie
  • 0x880 set MEIE and MTIE to 1, which means interrupt enable for machine level external and machine timer interrupts.
  • Reference Privileged spec

Incompatible problem

ABI is incompatible with that of the selected emulation:
  target emulation `elf32-littleriscv' does not match `elf64-littleriscv'
  • And I find a familiar issue, and follow the discussion and figure it out.
    • This problem is already fixed in #838

Memory not fit problem

/opt/riscv/lib/gcc/riscv64-unknown-elf/12.2.0/../../../../riscv64-unknown-elf/bin/ld: build/RTOSDemo.axf section `.text' will not fit in region `onChipRam'
/opt/riscv/lib/gcc/riscv64-unknown-elf/12.2.0/../../../../riscv64-unknown-elf/bin/ld: region `onChipRam' overflowed by 46800 bytes
  • Most of the time, VMA and LMA will be the same. But in embedded system, this situation might change.
  • Seems that onChipRam (4K) does not have enough capacity to store our code (text section).
  • Put the .text and .data section to sdram (64M).

Relocation truncated

build/start.o: in function `.L0 ':
/home/aaron/FreeRTOS-Briey-cs-latency/FreeRTOS/Demo/RISC-V-VexRiscv-Briey_GCC/start.S:68:(.init+0x62): relocation truncated to fit: R_RISCV_JAL against symbol `main' defined in .text.main section in build/main.o
-	jal main
+	call main

can solve this problem

Debugging

Copy <FreeRTOS repo>/FreeRTOS/Demo/Common into <FreeRTOS-Briey>/Demo which is not provided in repo.

"Full" vs "Minimal" demo application files

  • FreeRTOS/Demo/Common/Full directory assume a hosted environment and are only used by demos that run on top of old DOS systems(which is also why the Partest.c filename is cryptic - it could only use short filenames in 8.3 format)
  • FreeRTOS/Demo/Common/Minimal directory, none of which assume a hosted environment.

Under /FreeRTOS

# open GDB console
$ cd <demo_directory>/RISC-V_VexRiscv-Briey_GCC
$ make DEBUG=1 # for debug symbol
$ riscv64-unknown-elf-gdb build/RTOSdemo.axf

Compressed instruction stuck problem

I suffer from some unknown problem and the gdb will stuck and can not enter in main function.

Notice that the project is compile with -march=rv32imac, the compressed instrucion cause nop instruction into 2 bytes (in previous work set 7 nop instruction to push 0x80000020), and I think there is another problem (no figure it out yet, but I think single step(s) in gdb will let pc+4 at each step) that cause gdb stuck.

Also, li instruction is also compressed to 2 bytes. But in riscv64-unknown-elf-gdb, it can't load a1 to 0 properly.

80000028:	f1402573          	csrr	a0,mhartid
8000002c:	4581                	li	a1,0
8000002e:	06b51363          	bne	a0,a1,80000094 <secondary>

with gdb


And casue the next instruction bne jump to label secondary, which is handle for multicore, and it shouldn't be jump in my case.

So, I decide to build the project without compressed instruction.

Check multi-lib

riscv64-unknown-elf-gcc -print-multi-lib
.;
rv32i/ilp32;@march=rv32i@mabi=ilp32
rv32im/ilp32;@march=rv32im@mabi=ilp32
rv32iac/ilp32;@march=rv32iac@mabi=ilp32
rv32imac/ilp32;@march=rv32imac@mabi=ilp32
rv32imafc/ilp32f;@march=rv32imafc@mabi=ilp32f
rv64imac/lp64;@march=rv64imac@mabi=lp64

-CFLAGS  = -march=rv32imac -mabi=ilp32 -mcmodel=medany \
+CFLAGS  = -march=rv32im -mabi=ilp32 -mcmodel=medany \
	-Wall \
	-fmessage-length=0 \
	-ffunction-sections \
	-fdata-sections \
	-fno-builtin-printf
LDFLAGS = -nostartfiles -Tlinker.ld \
-	-march=rv32imac -mabi=ilp32 -mcmodel=medany \
+	-march=rv32im -mabi=ilp32 -mcmodel=medany \
	-Xlinker --gc-sections \
	-Xlinker --defsym=__stack_size=300 \
	-Xlinker -Map=RTOSDemo.map
  • And the gdb will run properly

But this method just abort the compressed instruction.

Configure BrieySoc to fit the compressed instruction (RVC)

There are some discusses about error and incompatible feature and some project with RVC compatible implementation.

Some similar project

And follow the cofiguration above, modify IBusCachedPlugin

new IBusCachedPlugin(
  resetVector = 0x80000000l,
  prediction = STATIC,
+  compressedGen = true, // for compressed instruction
  config = InstructionCacheConfig(
	cacheSize = 4096,
	bytePerLine =32,
	wayCount = 1,
	addressWidth = 32,
	cpuDataWidth = 32,
	memDataWidth = 32,
	catchIllegalAccess = true,
	catchAccessFault = true,
	asyncTagMemory = false,
-	twoCycleRam = true,
-	twoCycleCache = true
+	twoCycleRam = false,    // true -> false, for compressed instruction
+	twoCycleCache = false   // true -> false, for compressed instruction
  )
  • Rebuid Briey SoC and It can run properly with compressed instruction without modifying Makefile

Furthermore, remember that executing ecall will trigger a SWI and jump to the address of mtvec, which is 0x80000020. And that's why Oscar add serveral nop instructions to push the main_entry to 0x80000020.

But in RVC, nop was compressed to 2bytes, that means I need more nop instructions (2 times) to do the same thing.

_init:
	j    _start
	nop
	nop
	nop
	nop
	nop
	nop
	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
+	nop
	
	.globl main_entry
main_entry:
	la a0, freertos_risc_v_trap_handler
	jr a0

Linker script lma problem

There is a function in portASM.S to trap exception and view the decription.

freertos_risc_v_application_exception_handler:
    csrr t0, mcause     /* For viewing in the debugger only. */
    csrr t1, mepc       /* For viewing in the debugger only */
    csrr t2, mstatus    /* For viewing in the debugger only */
    j .

run(gdb)info register to check the mcause value

  • macuse = 0x5, stand for Load access fault
  • mepc = 0x80000074, stand for 80000074: 00052283 lw t0,0(a0)
  • mstatus = 0x1800

Check the disassembly

80000060:	a4c50513          	addi	a0,a0,-1460 # 80000aa8 <_bss_lma>
80000064:	c000c597          	auipc	a1,0xc000c
80000068:	6a458593          	addi	a1,a1,1700 # 4000c708 <impure_data>
8000006c:	82418613          	addi	a2,gp,-2012 # 4000d0cc <_bss>
80000070:	00c5fc63          	bgeu	a1,a2,80000088 <_start+0x48>
80000074:	00052283          	lw	t0,0(a0)

In assembly code is la a0, _data_lma, but in disassembly, is link to _bss_lma.

Put _data_lma lma same to vma.

{
	_data_lma = LOADADDR(.data.start);
- } >sdram AT>onChipRam
+ } >sdram AT>sdram

Now, it can successfully enter in main function. But trap to freertos_risc_v_application_exception_handler again.

40000254:	70f020ef          	jal	ra,40003162 <strlen>

riscv64-unknown-elf-gdb --version
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

  • Now can successfully show the string.

The bug is cause by setting CFLAGS with -march=rv32ima and LDFLAGS with -march=rv32imac and both with -mabi=ilp32

  • CFLAGS is used to compile code
  • LDFLAGS is used to link with linker script

It fixed when I used modified BrieySoc (with compressed instruction).

Illegal instruction problem

Change Briey.scala setting

        new CsrPlugin(
          config = CsrPluginConfig(
            catchIllegalAccess = false,
            mvendorid      = null,
            marchid        = null,
            mimpid         = null,
            mhartid        = null,
            misaExtensionsInit = 66,
            misaAccess     = CsrAccess.NONE,
            mtvecAccess    = CsrAccess.NONE,
            mtvecInit      = 0x80000020l,
            mepcAccess     = CsrAccess.READ_WRITE,
            mscratchGen    = false,
            mcauseAccess   = CsrAccess.READ_ONLY,
            mbadaddrAccess = CsrAccess.READ_ONLY,
            mcycleAccess   = CsrAccess.NONE,
            minstretAccess = CsrAccess.NONE,
-            ecallGen       = false,
+            ecallGen       = true,
            wfiGenAsWait         = false,
            ucycleAccess   = CsrAccess.NONE,
            uinstretAccess = CsrAccess.NONE
          )

And it can succefully enter the task.

Machine timer interrupt

But, still trap in exception

  • Which means Machine timer interrupt

But, if I use the GCC/RISC-V/portASM.S in FreeRTOS Kernel V10.4.6, It can successfully run context switch.

In previous version, portASM.S use portasmHANDLE_INTERRUPT for handling external interrupts. But in the latest version (FreeRTOS Kernel V10.5.1), the code was re-factoring. And portasmHANDLE_INTERRUPT disappear.

Using FreeRTOS on RISC-V Microcontrollers mention that to build FreeRTOS for a RISC-V core need:

  1. Include the core FreeRTOS source files and the FreeRTOS RISC-V port layer source files in your project.
  2. Ensure the assembler's include path includes the path to the header file that describes any chip specific implementation details.
  3. Define either a constant in FreeRTOSConfig.h or a linker variable to specify the memory to use as the interrupt stack.
  4. Define configMTIME_BASE_ADDRESS and configMTIMECMP_BASE_ADDRESS in FreeRTOSConfig.h.
  5. For the assembler, #define portasmHANDLE_INTERRUPT to the name of the function provided by your chip or tools vendor for handling external interrupts.
  6. Install the FreeRTOS trap handler.

So, I change the file to use external interrupts.

portASM.S

asynchronous_interrupt:
    store_x a1, 0( sp )                 /* Asynchronous interrupt so save unmodified exception return address. */
    load_x sp, xISRStackTop             /* Switch to ISR stack. */
-    j handle_interrupt
+    j portasmHANDLE_INTERRUPT           /* Handle external interrupts */
...
handle_exception:
    /* a0 contains mcause. */
    li t0, 11                                   /* 11 == environment call. */
-    bne a0, t0, application_exception_handler   /* Not an M environment call, so some other exception. */
+    bne a0, t0, portasmHANDLE_INTERRUPT   /* Handle external interrupts */
    call vTaskSwitchContext
    j processed_source
Hello FreeRTOS!
0: Tx: Transfer1
0: Tx: Transfer2
0: Rx: Blink1
0: Tx: Transfer1
0: Rx: Blink2
0: Tx: Transfer2

For some unknown problem, the results doesn't show as I think. The result should be show as below

Hello FreeRTOS!
0: Tx: Transfer1
0: Rx: Blink1
0: Tx: Transfer2
0: Rx: Blink2
0: Tx: Transfer1
0: Rx: Blink1
0: Tx: Transfer2
0: Rx: Blink2

which is shown in previous work

Noticed that in previous work, configUSE_PREEMPTION is set to 0, which means cooperative scheduling. So I just add portYIELD(); in the end of the sender and receiver task.

And the context switch will run properly as I think.

Cycle count and Latency in VexRiscv

In VexRiscv.scala

Define a 5-stage pipeline (Fetch, Decode, Execute, Memory, WriteBack)

  def newStage(): Stage = { val s = new Stage; stages += s; s }
  val decode    = newStage()
  val execute   = newStage()
  val memory    = ifGen(config.withMemoryStage)    (newStage())
  val writeBack = ifGen(config.withWriteBackStage) (newStage())

and Fetch stage has with RVC or not option

  def withRvc = plugins.find(_.isInstanceOf[IBusFetcher]) match {
    case Some(x) => x.asInstanceOf[IBusFetcher].withRvc
    case None => false
  }

which is utilized in Services.scala

trait IBusFetcher{
  def haltIt() : Unit
  def incoming() : Bool
  def pcValid(stage : Stage) : Bool
  def getInjectionPort() : Stream[Bits]
  def withRvc() : Boolean
  def forceNoDecode() : Unit
}

And there is serveral regression test that give clock cycle

In src/test/cpp/regression/main.cpp, we can find how regression test work.

There is a ready-valid handshake protocol implemented

for(SimElement* simElement : simElements) simElement->preCycle();

dump(i + 1);

checks();
//top->eval();
top->clk = 1;
top->eval();

instanceCycles += 1;

for(SimElement* simElement : simElements) simElement->postCycle();

Notice that instanceCycles was enclosed by two for loop preCycle() and postCycle()

And these two function definition is depended on IBUS configuration. Take IBUS_SIMPLE as example

virtual void preCycle(){
	if (top->iBus_cmd_valid && top->iBus_cmd_ready) {
		//assertEq(top->iBus_cmd_payload_pc & 3,0);
		pendings[wPtr] = (top->iBus_cmd_payload_pc);
		wPtr = (wPtr + 1) & 0xFF;
		//ws->iBusAccess(top->iBus_cmd_payload_pc,&inst_next,&error_next);
	}
}
//TODO doesn't catch when instruction removed ?
virtual void postCycle(){
	top->iBus_rsp_valid = 0;
	if(rPtr != wPtr && (!ws->iStall || VL_RANDOM_I_WIDTH(7) < 100)){
		uint32_t inst_next;
		bool error_next;
		ws->iBusAccess(pendings[rPtr], &inst_next,&error_next);
		rPtr = (rPtr + 1) & 0xFF;
		top->iBus_rsp_payload_inst = inst_next;
		top->iBus_rsp_valid = 1;
		top->iBus_rsp_payload_error = error_next;
	} else {
		top->iBus_rsp_payload_inst = VL_RANDOM_I_WIDTH(32);
		top->iBus_rsp_payload_error = VL_RANDOM_I_WIDTH(1);
	}
	if(ws->iStall) top->iBus_cmd_ready = VL_RANDOM_I_WIDTH(7) < 100;
}

It will call every simElement->preCycle() and check if iBus_cmd_valid and iBus_cmd_ready all both in high to make sure data transfer correctely.

For example

  • Slow writer(sender) and fast reader(receiver)

  • Fast writer(sender) and slow reader(receiver)

Under /regression, use the command

make clean run IBUS=SIMPLE DBUS=SIMPLE CSR=no MMU=no DEBUG_PLUGIN=no MUL=no DIV=no TRACE=yes

by giving TRACE=yes, it will generate .fst file and we can observe it by GTKWave.

Take rv32ui-p-xor.fst for example, which is one of the regression test.

  • When valid and ready are both 1, the pc will plus 4 at the next clock rising edge. Make sure the PC increment with instructions transfer properly.

And the check() function

virtual void checks(){
	if(top->VexRiscv->lastStageRegFileWrite_valid == 1 && top->VexRiscv->lastStageRegFileWrite_payload_address != 0){
		assertEq(top->VexRiscv->lastStageRegFileWrite_payload_address, regFileWriteRefArray[regFileWriteRefIndex][0]);
		assertEq(top->VexRiscv->lastStageRegFileWrite_payload_data, regFileWriteRefArray[regFileWriteRefIndex][1]);
		//printf("%d\n",i);

		regFileWriteRefIndex++;
		if(regFileWriteRefIndex == sizeof(regFileWriteRefArray)/sizeof(regFileWriteRefArray[0])){
			pass();
		}
	}
}
  • check the regfile it writable (valid bit = 1) and associated payload address is not 0.
  • and check user assertEq, which will throw an exception if two input is not equal
    • payload_address and regFileWriteRefIndex
    • payload_data and WriteRef

  • Which also can observe in GTKWave

If above two function pass, instanceCycles += 1;

And the postCycle() function I think it is related to GDB. (no figure out yet.)

Timer interrupt

#ifdef TIMER_INTERRUPT
top->timerInterrupt = mTime >= mTimeCmp ? 1 : 0;
//if(mTime == mTimeCmp) printf("SIM timer tick\n");
#endif

mtime will increment and if mTime greater than mTimeCmp, it will generate timerInterrupt

  • Still not figure out how timer interrupt behavior in cycle count.

Context Switch performance

From Paper and Reference Link, construct a test method

Paper Briey Soc
C Cache size 4096 byte
N Array size (process size) x (process number)
s stride (process size)
b line size 32 byte
a associativity 1 way

And associated definition can find in Briey.scala

cpuPlugins = ArrayBuffer(
        new PcManagerSimplePlugin(0x80000000l, false),
        //          new IBusSimplePlugin(
        //            interfaceKeepData = false,
        //            catchAccessFault = true
        //          ),
        new IBusCachedPlugin(
          resetVector = 0x80000000l,
          prediction = STATIC,
          compressedGen = true, // for compressed instruction
          config = InstructionCacheConfig(
            cacheSize = 4096,
            bytePerLine =32,
            wayCount = 1,
            addressWidth = 32,
            cpuDataWidth = 32,
            memDataWidth = 32,
            catchIllegalAccess = true,
            catchAccessFault = true,
            asyncTagMemory = false,
            twoCycleRam = false,    // true -> false, for compressed instruction
            twoCycleCache = false   // true -> false, for compressed instruction
          )
          //            askMemoryTranslation = true,
          //            memoryTranslatorPortConfig = MemoryTranslatorPortConfig(
          //              portTlbSize = 4
          //            )
        ),
        //                    new DBusSimplePlugin(
        //                      catchAddressMisaligned = true,
        //                      catchAccessFault = true
        //                    ),
        new DBusCachedPlugin(
          config = new DataCacheConfig(
            cacheSize         = 4096,
            bytePerLine       = 32,
            wayCount          = 1,
            addressWidth      = 32,
            cpuDataWidth      = 32,
            memDataWidth      = 32,
            catchAccessError  = true,
            catchIllegal      = true,
            catchUnaligned    = true
          ),
          memoryTranslatorPortConfig = null
          //            memoryTranslatorPortConfig = MemoryTranslatorPortConfig(
          //              portTlbSize = 6
          //            )
        ),

And the paper also mentioned several regimes with different size setting to measure time per iteration with or without cache miss

Regime Size of Array Stride Frequency of Misses Time per Iteration
1 \(1 \leq N \leq C\) \(1 \leq s \leq N/2\) no misses \(T_{no-miss}\)
2.a \(C \lt N\) \(1 \leq s \lt N/2\) one miss every \(b/s\) elements \(T_{no-miss} + Ds/b\)
2.b \(C \lt N\) \(b \leq s \lt N/a\) one miss every element \(T_{no-miss} + D\)
2.c \(C \lt N\) \(N/a \leq s \leq N/2\) no misses \(T_{no-miss}\)
  • No misses are generated when \(N \leq C\). When \(N \gt C\), the rate of misses is determined by the stride between consecutive elements. D is the delay penalty

In reference, first simply use an array to test the influence about cache miss.

In Briey, we have 4096 byte cache size, by setting configMINIMAL_STACK_SIZE = 128 and use a function to create number of tasks.

static void createTask(int num_tasks)
{
	char task_name[10];
	for(int i = 0; i < num_tasks; i++) {
		sprintf(task_name, "Task_%i", i);
		xTaskCreate(task, task_name, configMINIMAL_STACK_SIZE * 2U, i, tskIDLE_PRIORITY  + 1, NULL);
	}
}

And in the task handler, use xTaskGetTickCount to find the current tick count.

static void task(void *pvParameters)
{

	// TickType_t xNextWakeTime;
	( void ) pvParameters;

	for(;;)
	{
		end = xTaskGetTickCount();
		TickType_t ticks = end - start;
		char ee[40];
		char ss[40];
		sprintf(ee, "etime = %i", end);
		vSendString(ee);

		char buf[40];
		sprintf(buf, "Hello I'm task %i", pvParameters);

		vSendString(buf);

		start = xTaskGetTickCount();
		sprintf(ss, "stime = %i", start);
		vSendString(ss);
		portYIELD();
	}
}
Hello I'm task 0
stime = 0
etime = 0
Hello I'm task 1
stime = 0
etime = 0
Hello I'm task 2
stime = 0
etime = 0
Hello I'm task 3
stime = 1
etime = 1
Hello I'm task 4
stime = 1
etime = 1
Hello I'm task 5
stime = 1
etime = 1
Hello I'm task 6
stime = 2
etime = 2
Hello I'm task 7
stime = 2
etime = 2
Hello I'm task 8
stime = 2
etime = 2
Hello I'm task 9
stime = 2
etime = 3
Hello I'm task 10
stime = 3
etime = 3
Hello I'm task 11
stime = 3
etime = 3

Try to use etime - stime to measure the context switch latency, but it seems not work (stime and etime are the same value at the most time)

Implementation is still working

Reducing Context Switching Overhead by Processor Architecture Modification

The process of context switching involves storing the context (state) of the current executing task in to the stack and restoring the context of the task to be executed from the stack, the context includes

  • Program status word
  • Program counter
  • Stack pointer
  • temporary register values
  • data related to the next task

And the overhead can be reduced by migrating kernel services such as scheduling, time tick processing, and interrupt handling to hardware.

This paper is aim to reduce the effect of context switch overhead. One of the method is specializing certain register to a thread, which can eliminate the need for saving and restoring of context, but on the contrary it will reduce the number of registers available for other threads.

This paper provides a mothod that reduce the overhead by restricting the use of memory during context-switching by adding register file to the procerssor. This makes the process to compute at much faster rate therby reducin the overhead.

The implementation involves

  1. Hardware: Modify processor architecture and hardware implementation of instructions (scxt, rcxt)
  2. Software: Modify GNU MIPS assembler for adding two new instructions.

Hardware Modification

First, modify orginal Plasma MIPS design by implementing all the "reg_bank" register in FPGA's logic blocks. In order to save context registers on a register file in one CPU clock cycle.

Furthermore, 4 additional register files are implemented with each register file holds 12 context registers.

  • 9 saved or temporary registers
  • frame pointer register
  • global pointer register
  • stack pointer register

Modified MIPS Architecture (the red part is modified part)

Next, implementint two context-switching instructions (scxt, rcxt) to access these register files for storing and restoring the context durin context-switching operation.

Software Modification

First, developing a co-operative operating system involing basic function for

  • Initializing the OS
  • Creating tasks
  • Scheduling the tasks

Next, modify GCC compiler to aware of the newly added instruction, which inturn used to compile the MIPS C files. The instruction are specified in GNU binutils, and the file mips-opc.c in folder contains all the instructions supported by the MIPS processor. Like this.

/* These instructions appear first so that the disassembler will find
   them first.  The assemblers uses a hash table based on the
   instruction name anyhow.  */
/* name,		args,		match,	    mask,	pinfo,          	pinfo2,		membership,	ase,	exclusions */
{"pref",		"k,o(b)",	0xcc000000, 0xfc000000, RD_3|LM,           	0,		I4_32|G3,	0,	0 },
{"pref",		"k,A(b)",	0,    (int) M_PREF_AB,	INSN_MACRO,		0,		I4_32|G3,	0,	0 },
{"prefx",		"h,t(b)",	0x4c00000f, 0xfc0007ff, RD_2|RD_3|FP_S|LM,		0,		I4_33,		0,	0 },
{"nop",			"",		0x00000000, 0xffffffff, 0,              	INSN2_ALIAS,	I1,		0,	0 }, /* sll */
{"ssnop",		"",		0x00000040, 0xffffffff, 0,              	INSN2_ALIAS,	I1,		0,	0 }, /* sll */
{"ehb",			"",		0x000000c0, 0xffffffff, 0,              	INSN2_ALIAS,	I1,		0,	0 }, /* sll */
{"li",			"t,j",		0x24000000, 0xffe00000, WR_1,			INSN2_ALIAS,	I1,		0,	0 }, /* addiu */
{"li",			"t,i",		0x34000000, 0xffe00000, WR_1,			INSN2_ALIAS,	I1,		0,	0 }, /* ori */
{"li",			"t,I",		0,    (int) M_LI,	INSN_MACRO,		0,		I1,		0,	0 },
{"move",		"d,s",		0,    (int) M_MOVE,	INSN_MACRO,		0,		I1,		0,	0 },
{"move",		"d,s",		0x0000002d, 0xfc1f07ff, WR_1|RD_2,		INSN2_ALIAS,	I3,		0,	0 },/* daddu */
{"move",		"d,s",		0x00000021, 0xfc1f07ff, WR_1|RD_2,		INSN2_ALIAS,	I1,		0,	0 },/* addu */
{"move",		"d,s",		0x00000025, 0xfc1f07ff, WR_1|RD_2,		INSN2_ALIAS,	I1,		0,	0 },/* or */
{"b",			"p",		0x10000000, 0xffff0000, UBD,			INSN2_ALIAS,	I1,		0,	0 },/* beq 0,0 */
{"b",			"p",		0x04010000, 0xffff0000, UBD,			INSN2_ALIAS,	I1,		0,	0 },/* bgez 0 */
{"bal",			"p",		0x04110000, 0xffff0000, WR_31|UBD,		INSN2_ALIAS,	I1,		0,	0 },/* bgezal 0*/
+{"scxt",		"t",		0x0000003C, 0xffffffff, RD_t, 			0		1,		0,	0 },/* scxt */
+{"rcxt",		"t",		0x0000003D, 0xffffffff, RD_t, 			0		1,		0,	0 },/* scxt */

makes GCC compiler compatible to modified architecture.

Applications & Results (in paper)

First application contain four tasks created using createTask(), the firset task deals with incrementing variables followed by adding them. Storing the result in sum variable and finally diplaying number of clock cycles.

Second application comprises of four tasks, two of them are structured to undergo fast context switchin using internal register files and the other two tasks are structured to undergo context switching using external RAM

Conclusion (in paper)

Restricting the process of context-switching to the processir itself by modifing CPU architecture, without having the external memory access the context of tasks, rules out the extra time consumption.

Reference