Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by < dhiptmc >

tags: `RISC-V` `Computer Architecture 2022`

Problem

Decompress Run-Length Encoded List(LeetCode1313)

We are given a list nums of integers representing a list compressed with run-length encoding.

Consider each adjacent pair of elements [freq, val] = [nums[2i], nums[2i+1]] (with i >= 0). For each such pair, there are freq elements with value val concatenated in a sublist. Concatenate all the sublists from left to right to generate the decompressed list.

Return the decompressed list.

Example 1:

Input: nums = [1,2,3,4]
Output: [2,4,4,4]
Explanation: The first pair [1,2] means we have freq = 1 and val = 2 so we generate the array [2].
The second pair [3,4] means we have freq = 3 and val = 4 so we generate [4,4,4].
At the end the concatenation [2] + [4,4,4] is [2,4,4,4].

Example 2:

Input: nums = [1,1,2,3]
Output: [1,3,3]

Solution

Input array and numsSize are known when we start this program.

We first traverse the input array (int i = 0; i < numsSize; i += 2) to find the size of the output array.
Next we collect the element after "freq" so as to get "val".

Note that each "val" needs to be insert into output array for "freq" times.

Implementation

You can find the source code here. Feel free to fork and modify it.

C code
























/**
 * Note: The returned array must be malloced, assume caller calls free().
 */
int *decompressRLElist(int *nums, int numsSize, int *returnSize)
{
    *returnSize = 0;
    for (int i = 0; i < numsSize; i += 2)
    {
        *returnSize += nums[i];
    }

    int *result = malloc(sizeof(int) * (*returnSize));
    int count = 0;

    for (int i = 0; i < numsSize; i += 2)
    {
        for (int j = 0; j < nums[i]; j++)
        {
            result[count++] = nums[i+1];
        }
    }

    return result;
}

Assembly code

.data
nums1: .word 1,2,3,4            #2444
nums2: .word 6,5,1,3,2,1        #555555311  
nums3: .word 1,3,2,4,2,7,1,8    #344778
numsSize1: .word 4
numsSize2: .word 6
numsSize3: .word 8
returnSize1: .word 0
returnSize2: .word 0
returnSize3: .word 0
enter: .string "\n"

.text
main_1:                     # execute first test data
    la   a1, nums1          # get nums[i]     address    a1->nums1
    la   a2, numsSize1      # get numsSize    address    a2->numsSize1
    lw   a2, 0(a2)          # a2 = numsSize
    la   a3, returnSize1    # get returnSize  address    a3->returnSize1
    lw   a3, 0(a3)          # a3 = returnSize
    addi a4, zero, 0        # a4 = i = 0
    #sw   0, 0(a3)          # *returnSize1 = 0
    addi s2, zero, 2        # record number of data set need to be execute
    j    outer_loop

#####loop#####
proc:
    lw   t1, 4(t0)          # nums[i+1]
    slli t2, a6, 2          # calculate result[count] address
    add  t2, t2, s1         # calculate result[count] address //s1 -> result base address
    sw   t1, 0(t2)          # result[count] = nums[i+1];
    addi a6, a6, 1          # count++
    addi a5, a5, 1          # j++ 

inner_loop:
    slli t0, a4, 2          # calculate nums[i] address
    add  t0, t0, a1         # calculate nums[i] address //a1 -> nums base address
    lw   t1, 0(t0)          # load nums[i] to t1  
    blt  a5, t1, proc       # if j < nums[i], jump to proc
    add  a3, a3, t1         # *returnSize += nums[i] ##modified  
    addi a5, zero, 0        # a5 = j = 0    
    addi a4, a4, 2          # i += 2

outer_loop:
    blt  a4, a2, inner_loop # if i < numsSize, jump to inner_loop
    addi a4, zero, 0        # set i = 0
    addi a5, zero, 0        # a5 = j = 0   
    addi a6, zero, 0        # a6 = count = 0
    j    print

#####load different test data#####
main_2:
    la   a1, nums2
    la   a2, numsSize2      # get numsSize2   address    a2->numsSize2
    lw   a2, 0(a2)          # a2 = numsSize2
    la   a3, returnSize2    # get returnSize2 address    a3->returnSize2
    lw   a3, 0(a3)          # a3 = returnSize
    addi a4, zero, 0        # reset i
    j    outer_loop

main_3:
    la   a1, nums3
    la   a2, numsSize3      # get numsSize3   address    a2->numsSize3
    lw   a2, 0(a2)          # a2 = numsSize3
    la   a3, returnSize3    # get returnSize3 address    a3->returnSize3
    lw   a3, 0(a3)          # a3 = returnSize
    addi a4, zero, 0        # reset i
    j    outer_loop

#####print result#####
print:
    slli t2, a4, 2          # calculate result[i] address
    add  t2, t2 ,s1         # calculate result[i] address //s1 -> result base address
    lw   a0, 0(t2)          # load result[i] to a0
    li   a7, 1              # system call print
    ecall                   # execute system call
    
    addi a4, a4, 1          # i++
    blt  a4, a3, print      # if i < returnSize, jump to print
    la   a0, enter          # print \n
    li   a7, 4              # a7 = 4 means print string
    ecall
    
    addi s2, s2, -1 
    bgtz s2,main_2          # if s2 >  0 jump to main_2
    beqz s2,main_3          # if s2 == 0 jump to main_3
    j    exit

exit:
    li   a7, 10             # system call exit
    ecall                   # execute system call

:warning: Can you use fewer instructions?
:notes: jserv

Indeed, there are possible ways to use fewer instructions.
Modified version of Assembly code is shown below.

.data
nums: .word 1,2,3,4, 6,5,1,3,2,1, 1,3,2,4,2,7,1,8
numsSize: .word 4,6,8
returnSize: .word 0,0,0
enter: .string "\n"
result: .word 0,0,0,0,0,0,0,0,0,0 ##used to avoid EXCEEDING boundaries

#####function starts#####
.text
initial:
    addi t5, zero, 3        # record number of testing datas need to be executed
    addi s4, zero, 0        # identify which testing data is executing
    la   a1, nums           # get nums[i]     address    a1 -> nums
    la   a2, numsSize       # get numsSize    address    a2 -> numsSize
    la   a3, returnSize     # get returnSize  address    a3 -> returnSize

main:   
    lw   s3, 0(a3)          # s3 = returnSize
    lw   s2, 0(a2)          # s2 = numsSize
    lw   a4, result         # base address of result
    lw   t4, result         # the desired result address
    j    loop

#####loop#####
proc:
    sw   t2, 0(t4)          # result[count] = nums[i+1]
    addi t4, t4, 4          # t4 -> result[count+1]
    addi a6, a6, 1          # j++
    blt  a6, t1, proc       # if j < nums[i], jump back to proc
    add  s3, s3, t1         # *returnSize += nums[i]
    addi a6, zero, 0        # a6 = j = 0
    addi a5, a5, 2          # i += 2
    addi a1, a1, 8          # a1 -> nums[i+2]

loop:
    lw   t1, 0(a1)          # t1 = nums[i]
    lw   t2, 4(a1)          # t2 = nums[i+1]
    bge  a5, s2, print      # if i >= numsSize, jump to print
    blt  a6, t1, proc       # if j  < nums[i] , jump to proc

#####print result and initialize next test set#####
print:
    lw   a0, 0(a4)          # load result[i] to a0
    li   a7, 1              # system call print
    ecall                   # execute system call
    
    addi a4, a4, 4          # move to next result word
    bne  a4, t4, print      # if a4 != final address of t4, jump to print
    la   a0, enter          # print \n
    li   a7, 4              # a7 = 4 means print string
    ecall
    
    ##initialize
    addi a5, zero, 0        # a5 = i = 0

    addi s4, s4, 1          # move to the next testing data
    addi a2, a2, 4          # move to the next numsSize
    addi a3, a3, 4          # move to the next returnSize
    blt  s4, t5, main       # if s4 < 3 jump to main
    j    exit

exit:
    li   a7, 10             # system call exit
    ecall                   # execute system call

As a result, clock cycle as the figure shown below decreases compared to the original work.

Origin:

After modified:

Analysis

We test our code using Ripes simulator

Results

Test data 1

Input: nums = [1,2,3,4]
Output: [244]

Test data 2

Input: nums = [6,5,1,3,2,1]
Output: [555555311]

Test data 3

Input: nums = [1,3,2,4,2,7,1,8]
Output: [344778]

Ripes print

5-stage pipelined processor

IF: instruction fetch stage

Fetch instruction from memory.
At this example, lw x12 0 x12 is in IF stage.

Current PC(Program counter) is 0x00000010, which leads to an instruction decoded as 0x00062603.
We can find out that there is a simple ALU that adds 4 to the current PC which eventually returns the next PC.
Meanwhile, current PC and PC+4 will be delivered to IFID for later use.

Noted that there is also a possibility that PC(PC Unit) receives a value from EXE stage.

ID: instruction decode/register file read stage

Read registers while decoding the instruction.
At this example, addi x12 x12 64 is in ID stage.

Decode(Decode Unit) outputs the x12 address(0x0c) for the Registers, and Registers outputs the corresponding value stored in x12 to IDEX. In this circumstance, Reg2 is unnecessary for addi operation, so we can ignore it.
Imm. unit(Immediate Generator) outputs the immediaite value, in this case, 0x00000040(32'd64) will be delivered to IDEX for later use.

Noted that target register (x12) address for storing back to the Registers, PC, and PC+4 are also delivered to IDEX.

EXE: execution stage

Execute the operation or calculate an address.
At this example, auipc x12 0x10000 is in EXE stage.

EXE stage is somewhat complicated. 4 MUXs(multiplexer) can be found.
The left two MUXs choose the source of the input registers, which can be directly come from IDEX (Reg1/Reg2) or from two different forwarding paths.
The right two MUXs choose whether the Op source comes from PC, IMM. unit or register. In this case, Op1 comes from PC, and Op2 comes from IMM. unit.
ALU unit eventually finsihes the arithmetic operation and outputs ths result to EXMEM, and outputs the target PC address to IF stage in some branch instructions.

In addition, a Branch unit takes value of two different registers in order to perform branch control, which splits the traditional operation from the ALU unit so as to alleviate the workload for ALU unit, and may have better performance for branch operations.

Noted that target register, one register input, and PC+4 is also delivered to EX/MEM.

MEM: memory access stage

Access an operand in data memory.
At this example, addi x11 x11 0 is in MEM stage.

Data memory unit recieves the Addr. and Data in signals to perform memory read/write. In this circumstance, there is no need to use Data memory unit, so we can simply ignore the Read out result.

In addition, result from the previous EXE stage can be forwarded to the current EXE stage.

WB: write-back stage

Write the result into a register.
At this example, auipc x11 0x10000 is in WB stage.

MUX chooses the input signal from PC+4, Data memory unit, or result from the EXE stage.
In this circumstance, we choose the result from the EXE stage(0x10000000), and stores the value to the register whose target address is 0x0b.

In addition, output signal from the MUX can be forwarded to the EXE stage.

Memory update steps

Memory condition when program starts

From the Instruction memory list, we can discover that the next few instructions has been loaded into the pipeline.
For instance, lw x12 0 x12 is in IF stage.

Memory updates during main_1

a1 gets the base address of nums1.
a2 gets the numsSize1.
a3 = returnSize which initials as 0 for later use.
a4 = i = 0 for later branch deciding.
s2 = 2 records numbers of testing sets need to be execute.

Memory updates during proc and inner_loop

t0 gets the nums1[i] address.
t1 first loads nums[i], and is used to decide whether to branch to proc or not.

Then, t1 loads nums[i+1], t2 gets result[count] address, and nums[i+1] is stored in "result[count] address".
Noted that we shift the index 2 bits so as to match the memory address.

Eventually, we make a5 += 1(j++), a6 += 1(count++).
t1 reloads nums[i], and we check the relationship between the value in a5 and t1, which leads us to decide whether go backs to proc or not.
If we don't need to branch to proc again, we add t1(nums[i]) to a3(returnSize), set a5(j) to 0, and add 2 to a4(i).

Memory updates during outer_loop

We reset a4(i), a5(j), a6(count) to zero, prepare for later use.

Memory updates after print

s2 value is decreased by 1.

Hazard

Data Hazard

Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur:

read after write (RAW), a true dependency
write after read (WAR), an anti-dependency
write after write (WAW), an output dependency
Read after read (RAR) is not a hazard case.
- Solution
1. Forwarding
  
  By forwarding, we can use the result from MEM stage or WB stage immediately.
  As the picture shown above, the current instruction in EXE stage is addi x11 x11 0, while the previous instruction is auipc x11 0x10000.
  Therefore, we pass the value from EX/MEM to EXE stage as the yellow line shows in order to solve the data hazard.
2. Stall when Load-Use
  
  One case where forwarding cannot save the day is when an instruction tries to read a register following a load instruction that writes the same register. The picture shown above illustrates the problem. In this situation, a stall is needed.

Control Hazard

Control hazard occurs when the pipeline makes wrong decisions on branch prediction and therefore brings instructions into the pipeline that must subsequently be discarded. The term branch hazard also refers to a control hazard.

Solution:

Control hazard occurs when the branch is taken. Hence, we need to flush the sequential instructions that follow the branch. As a matter of fact, we will waste 2 cycles since branch is dicided at EXE stage.

One way to improve branch performance is to reduce the cost of the taken branch. Moving the branch execution to the ID stage will be a wise choice, since it reduces the penalty of a branch to only one instruction if the branch is taken, namely, the one currently being fetched. Despite there may be several difficulties to conquer (several new units and mechanisms needed), it is still an improvement to the performance.

David A. PATTERSON, JOHN L. HENNESSY. Computer Organization and Design 4/e, page 379 FIGURE 4.62.

Pseudo instruction