Try   HackMD

Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by < dhiptmc >

tags: RISC-V Computer Architecture 2022

Problem

Decompress Run-Length Encoded List(LeetCode1313)

We are given a list nums of integers representing a list compressed with run-length encoding.

Consider each adjacent pair of elements [freq, val] = [nums[2i], nums[2i+1]] (with i >= 0). For each such pair, there are freq elements with value val concatenated in a sublist. Concatenate all the sublists from left to right to generate the decompressed list.

Return the decompressed list.

Example 1:

Input: nums = [1,2,3,4]
Output: [2,4,4,4]
Explanation: The first pair [1,2] means we have freq = 1 and val = 2 so we generate the array [2].
The second pair [3,4] means we have freq = 3 and val = 4 so we generate [4,4,4].
At the end the concatenation [2] + [4,4,4] is [2,4,4,4].

Example 2:

Input: nums = [1,1,2,3]
Output: [1,3,3]

Solution

Input array and numsSize are known when we start this program.

We first traverse the input array (int i = 0; i < numsSize; i += 2) to find the size of the output array.
Next we collect the element after "freq" so as to get "val".

Note that each "val" needs to be insert into output array for "freq" times.

Implementation

You can find the source code here. Feel free to fork and modify it.

C code

/** * Note: The returned array must be malloced, assume caller calls free(). */ int *decompressRLElist(int *nums, int numsSize, int *returnSize) { *returnSize = 0; for (int i = 0; i < numsSize; i += 2) { *returnSize += nums[i]; } int *result = malloc(sizeof(int) * (*returnSize)); int count = 0; for (int i = 0; i < numsSize; i += 2) { for (int j = 0; j < nums[i]; j++) { result[count++] = nums[i+1]; } } return result; }

Assembly code

.data
nums1: .word 1,2,3,4            #2444
nums2: .word 6,5,1,3,2,1        #555555311  
nums3: .word 1,3,2,4,2,7,1,8    #344778
numsSize1: .word 4
numsSize2: .word 6
numsSize3: .word 8
returnSize1: .word 0
returnSize2: .word 0
returnSize3: .word 0
enter: .string "\n"

.text
main_1:                     # execute first test data
    la   a1, nums1          # get nums[i]     address    a1->nums1
    la   a2, numsSize1      # get numsSize    address    a2->numsSize1
    lw   a2, 0(a2)          # a2 = numsSize
    la   a3, returnSize1    # get returnSize  address    a3->returnSize1
    lw   a3, 0(a3)          # a3 = returnSize
    addi a4, zero, 0        # a4 = i = 0
    #sw   0, 0(a3)          # *returnSize1 = 0
    addi s2, zero, 2        # record number of data set need to be execute
    j    outer_loop

#####loop#####
proc:
    lw   t1, 4(t0)          # nums[i+1]
    slli t2, a6, 2          # calculate result[count] address
    add  t2, t2, s1         # calculate result[count] address //s1 -> result base address
    sw   t1, 0(t2)          # result[count] = nums[i+1];
    addi a6, a6, 1          # count++
    addi a5, a5, 1          # j++ 

inner_loop:
    slli t0, a4, 2          # calculate nums[i] address
    add  t0, t0, a1         # calculate nums[i] address //a1 -> nums base address
    lw   t1, 0(t0)          # load nums[i] to t1  
    blt  a5, t1, proc       # if j < nums[i], jump to proc
    add  a3, a3, t1         # *returnSize += nums[i] ##modified  
    addi a5, zero, 0        # a5 = j = 0    
    addi a4, a4, 2          # i += 2

outer_loop:
    blt  a4, a2, inner_loop # if i < numsSize, jump to inner_loop
    addi a4, zero, 0        # set i = 0
    addi a5, zero, 0        # a5 = j = 0   
    addi a6, zero, 0        # a6 = count = 0
    j    print

#####load different test data#####
main_2:
    la   a1, nums2
    la   a2, numsSize2      # get numsSize2   address    a2->numsSize2
    lw   a2, 0(a2)          # a2 = numsSize2
    la   a3, returnSize2    # get returnSize2 address    a3->returnSize2
    lw   a3, 0(a3)          # a3 = returnSize
    addi a4, zero, 0        # reset i
    j    outer_loop

main_3:
    la   a1, nums3
    la   a2, numsSize3      # get numsSize3   address    a2->numsSize3
    lw   a2, 0(a2)          # a2 = numsSize3
    la   a3, returnSize3    # get returnSize3 address    a3->returnSize3
    lw   a3, 0(a3)          # a3 = returnSize
    addi a4, zero, 0        # reset i
    j    outer_loop

#####print result#####
print:
    slli t2, a4, 2          # calculate result[i] address
    add  t2, t2 ,s1         # calculate result[i] address //s1 -> result base address
    lw   a0, 0(t2)          # load result[i] to a0
    li   a7, 1              # system call print
    ecall                   # execute system call
    
    addi a4, a4, 1          # i++
    blt  a4, a3, print      # if i < returnSize, jump to print
    la   a0, enter          # print \n
    li   a7, 4              # a7 = 4 means print string
    ecall
    
    addi s2, s2, -1 
    bgtz s2,main_2          # if s2 >  0 jump to main_2
    beqz s2,main_3          # if s2 == 0 jump to main_3
    j    exit

exit:
    li   a7, 10             # system call exit
    ecall                   # execute system call

:warning: Can you use fewer instructions?
:notes: jserv

Indeed, there are possible ways to use fewer instructions.
Modified version of Assembly code is shown below.

.data
nums: .word 1,2,3,4, 6,5,1,3,2,1, 1,3,2,4,2,7,1,8
numsSize: .word 4,6,8
returnSize: .word 0,0,0
enter: .string "\n"
result: .word 0,0,0,0,0,0,0,0,0,0 ##used to avoid EXCEEDING boundaries

#####function starts#####
.text
initial:
    addi t5, zero, 3        # record number of testing datas need to be executed
    addi s4, zero, 0        # identify which testing data is executing
    la   a1, nums           # get nums[i]     address    a1 -> nums
    la   a2, numsSize       # get numsSize    address    a2 -> numsSize
    la   a3, returnSize     # get returnSize  address    a3 -> returnSize

main:   
    lw   s3, 0(a3)          # s3 = returnSize
    lw   s2, 0(a2)          # s2 = numsSize
    lw   a4, result         # base address of result
    lw   t4, result         # the desired result address
    j    loop

#####loop#####
proc:
    sw   t2, 0(t4)          # result[count] = nums[i+1]
    addi t4, t4, 4          # t4 -> result[count+1]
    addi a6, a6, 1          # j++
    blt  a6, t1, proc       # if j < nums[i], jump back to proc
    add  s3, s3, t1         # *returnSize += nums[i]
    addi a6, zero, 0        # a6 = j = 0
    addi a5, a5, 2          # i += 2
    addi a1, a1, 8          # a1 -> nums[i+2]

loop:
    lw   t1, 0(a1)          # t1 = nums[i]
    lw   t2, 4(a1)          # t2 = nums[i+1]
    bge  a5, s2, print      # if i >= numsSize, jump to print
    blt  a6, t1, proc       # if j  < nums[i] , jump to proc

#####print result and initialize next test set#####
print:
    lw   a0, 0(a4)          # load result[i] to a0
    li   a7, 1              # system call print
    ecall                   # execute system call
    
    addi a4, a4, 4          # move to next result word
    bne  a4, t4, print      # if a4 != final address of t4, jump to print
    la   a0, enter          # print \n
    li   a7, 4              # a7 = 4 means print string
    ecall
    
    ##initialize
    addi a5, zero, 0        # a5 = i = 0

    addi s4, s4, 1          # move to the next testing data
    addi a2, a2, 4          # move to the next numsSize
    addi a3, a3, 4          # move to the next returnSize
    blt  s4, t5, main       # if s4 < 3 jump to main
    j    exit

exit:
    li   a7, 10             # system call exit
    ecall                   # execute system call

As a result, clock cycle as the figure shown below decreases compared to the original work.

Origin:

After modified:

Analysis

We test our code using Ripes simulator

Results

  • Test data 1
Input: nums = [1,2,3,4]
Output: [244]
  • Test data 2
Input: nums = [6,5,1,3,2,1]
Output: [555555311]
  • Test data 3
Input: nums = [1,3,2,4,2,7,1,8]
Output: [344778]

Ripes print

5-stage pipelined processor

IF: instruction fetch stage


Fetch instruction from memory.
At this example, lw x12 0 x12 is in IF stage.

Current PC(Program counter) is 0x00000010, which leads to an instruction decoded as 0x00062603.
We can find out that there is a simple ALU that adds 4 to the current PC which eventually returns the next PC.
Meanwhile, current PC and PC+4 will be delivered to IFID for later use.

Noted that there is also a possibility that PC(PC Unit) receives a value from EXE stage.

ID: instruction decode/register file read stage


Read registers while decoding the instruction.
At this example, addi x12 x12 64 is in ID stage.

Decode(Decode Unit) outputs the x12 address(0x0c) for the Registers, and Registers outputs the corresponding value stored in x12 to IDEX. In this circumstance, Reg2 is unnecessary for addi operation, so we can ignore it.
Imm. unit(Immediate Generator) outputs the immediaite value, in this case, 0x00000040(32'd64) will be delivered to IDEX for later use.

Noted that target register (x12) address for storing back to the Registers, PC, and PC+4 are also delivered to IDEX.

EXE: execution stage


Execute the operation or calculate an address.
At this example, auipc x12 0x10000 is in EXE stage.

EXE stage is somewhat complicated. 4 MUXs(multiplexer) can be found.
The left two MUXs choose the source of the input registers, which can be directly come from IDEX (Reg1/Reg2) or from two different forwarding paths.
The right two MUXs choose whether the Op source comes from PC, IMM. unit or register. In this case, Op1 comes from PC, and Op2 comes from IMM. unit.
ALU unit eventually finsihes the arithmetic operation and outputs ths result to EXMEM, and outputs the target PC address to IF stage in some branch instructions.

In addition, a Branch unit takes value of two different registers in order to perform branch control, which splits the traditional operation from the ALU unit so as to alleviate the workload for ALU unit, and may have better performance for branch operations.

Noted that target register, one register input, and PC+4 is also delivered to EX/MEM.

MEM: memory access stage


Access an operand in data memory.
At this example, addi x11 x11 0 is in MEM stage.

Data memory unit recieves the Addr. and Data in signals to perform memory read/write. In this circumstance, there is no need to use Data memory unit, so we can simply ignore the Read out result.

In addition, result from the previous EXE stage can be forwarded to the current EXE stage.

WB: write-back stage


Write the result into a register.
At this example, auipc x11 0x10000 is in WB stage.

MUX chooses the input signal from PC+4, Data memory unit, or result from the EXE stage.
In this circumstance, we choose the result from the EXE stage(0x10000000), and stores the value to the register whose target address is 0x0b.

In addition, output signal from the MUX can be forwarded to the EXE stage.

Memory update steps

Memory condition when program starts

From the Instruction memory list, we can discover that the next few instructions has been loaded into the pipeline.
For instance, lw x12 0 x12 is in IF stage.

Memory updates during main_1

a1 gets the base address of nums1.
a2 gets the numsSize1.
a3 = returnSize which initials as 0 for later use.
a4 = i = 0 for later branch deciding.
s2 = 2 records numbers of testing sets need to be execute.

Memory updates during proc and inner_loop

t0 gets the nums1[i] address.
t1 first loads nums[i], and is used to decide whether to branch to proc or not.

Then, t1 loads nums[i+1], t2 gets result[count] address, and nums[i+1] is stored in "result[count] address".
Noted that we shift the index 2 bits so as to match the memory address.

Eventually, we make a5 += 1(j++), a6 += 1(count++).
t1 reloads nums[i], and we check the relationship between the value in a5 and t1, which leads us to decide whether go backs to proc or not.
If we don't need to branch to proc again, we add t1(nums[i]) to a3(returnSize), set a5(j) to 0, and add 2 to a4(i).

Memory updates during outer_loop

We reset a4(i), a5(j), a6(count) to zero, prepare for later use.

Memory updates after print

s2 value is decreased by 1.

Hazard

Data Hazard

Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur:

  • read after write (RAW), a true dependency

  • write after read (WAR), an anti-dependency

  • write after write (WAW), an output dependency
    Read after read (RAR) is not a hazard case.

    • Solution
    1. Forwarding

      By forwarding, we can use the result from MEM stage or WB stage immediately.
      As the picture shown above, the current instruction in EXE stage is addi x11 x11 0, while the previous instruction is auipc x11 0x10000.
      Therefore, we pass the value from EX/MEM to EXE stage as the yellow line shows in order to solve the data hazard.
    2. Stall when Load-Use

      One case where forwarding cannot save the day is when an instruction tries to read a register following a load instruction that writes the same register. The picture shown above illustrates the problem. In this situation, a stall is needed.

Control Hazard

Control hazard occurs when the pipeline makes wrong decisions on branch prediction and therefore brings instructions into the pipeline that must subsequently be discarded. The term branch hazard also refers to a control hazard.

  • Solution:

    Control hazard occurs when the branch is taken. Hence, we need to flush the sequential instructions that follow the branch. As a matter of fact, we will waste 2 cycles since branch is dicided at EXE stage.

One way to improve branch performance is to reduce the cost of the taken branch. Moving the branch execution to the ID stage will be a wise choice, since it reduces the penalty of a branch to only one instruction if the branch is taken, namely, the one currently being fetched. Despite there may be several difficulties to conquer (several new units and mechanisms needed), it is still an improvement to the performance.

David A. PATTERSON, JOHN L. HENNESSY. Computer Organization and Design 4/e, page 379 FIGURE 4.62.

Pseudo instruction

Origin:

00000000 <main_1>:
    0:         10000597        auipc x11 0x10000
    4:         00058593        addi x11 x11 0
    8:         10000617        auipc x12 0x10000
    c:         04060613        addi x12 x12 64
    10:        00062603        lw x12 0 x12
    14:        10000697        auipc x13 0x10000
    18:        04068693        addi x13 x13 64
    1c:        0006a683        lw x13 0 x13
    20:        00000713        addi x14 x0 0
    24:        00200913        addi x18 x0 2
    28:        0380006f        jal x0 56 <outer_loop>

0000002c <proc>:
    2c:        0042a303        lw x6 4 x5
    30:        00281393        slli x7 x16 2
    34:        009383b3        add x7 x7 x9
    38:        0063a023        sw x6 0 x7
    3c:        00180813        addi x16 x16 1
    40:        00178793        addi x15 x15 1

00000044 <inner_loop>:
    44:        00271293        slli x5 x14 2
    48:        00b282b3        add x5 x5 x11
    4c:        0002a303        lw x6 0 x5
    50:        fc67cee3        blt x15 x6 -36 <proc>
    54:        006686b3        add x13 x13 x6
    58:        00000793        addi x15 x0 0
    5c:        00270713        addi x14 x14 2

00000060 <outer_loop>:
    60:        fec742e3        blt x14 x12 -28 <inner_loop>
    64:        00000713        addi x14 x0 0
    68:        00000793        addi x15 x0 0
    6c:        00000813        addi x16 x0 0
    70:        0540006f        jal x0 84 <print>

00000074 <main_2>:
    74:        10000597        auipc x11 0x10000
    78:        f9c58593        addi x11 x11 -100
    7c:        10000617        auipc x12 0x10000
    80:        fd060613        addi x12 x12 -48
    84:        00062603        lw x12 0 x12
    88:        10000697        auipc x13 0x10000
    8c:        fd068693        addi x13 x13 -48
    90:        0006a683        lw x13 0 x13
    94:        00000713        addi x14 x0 0
    98:        fc9ff06f        jal x0 -56 <outer_loop>

0000009c <main_3>:
    9c:        10000597        auipc x11 0x10000
    a0:        f8c58593        addi x11 x11 -116
    a4:        10000617        auipc x12 0x10000
    a8:        fac60613        addi x12 x12 -84
    ac:        00062603        lw x12 0 x12
    b0:        10000697        auipc x13 0x10000
    b4:        fac68693        addi x13 x13 -84
    b8:        0006a683        lw x13 0 x13
    bc:        00000713        addi x14 x0 0
    c0:        fa1ff06f        jal x0 -96 <outer_loop>

000000c4 <print>:
    c4:        00271393        slli x7 x14 2
    c8:        009383b3        add x7 x7 x9
    cc:        0003a503        lw x10 0 x7
    d0:        00100893        addi x17 x0 1
    d4:        00000073        ecall
    d8:        00170713        addi x14 x14 1
    dc:        fed744e3        blt x14 x13 -24 <print>
    e0:        10000517        auipc x10 0x10000
    e4:        f8050513        addi x10 x10 -128
    e8:        00400893        addi x17 x0 4
    ec:        00000073        ecall
    f0:        fff90913        addi x18 x18 -1
    f4:        f92040e3        blt x0 x18 -128 <main_2>
    f8:        fa0902e3        beq x18 x0 -92 <main_3>
    fc:        0040006f        jal x0 4 <exit>

00000100 <exit>:
    100:       00a00893        addi x17 x0 10
    104:       00000073        ecall

Modified:

00000000 <initial>:
    0:         00300f13        addi x30 x0 3
    4:         00000a13        addi x20 x0 0
    8:         10000597        auipc x11 0x10000
    c:         ff858593        addi x11 x11 -8
    10:        10000617        auipc x12 0x10000
    14:        03860613        addi x12 x12 56
    18:        10000697        auipc x13 0x10000
    1c:        03c68693        addi x13 x13 60

00000020 <main>:
    20:        0006a983        lw x19 0 x13
    24:        00062903        lw x18 0 x12
    28:        10000717        auipc x14 0x10000
    2c:        03a72703        lw x14 58 x14
    30:        10000e97        auipc x29 0x10000
    34:        032eae83        lw x29 50 x29
    38:        0240006f        jal x0 36 <loop>

0000003c <proc>:
    3c:        007ea023        sw x7 0 x29
    40:        004e8e93        addi x29 x29 4
    44:        00180813        addi x16 x16 1
    48:        fe684ae3        blt x16 x6 -12 <proc>
    4c:        006989b3        add x19 x19 x6
    50:        00000813        addi x16 x0 0
    54:        00278793        addi x15 x15 2
    58:        00858593        addi x11 x11 8

0000005c <loop>:
    5c:        0005a303        lw x6 0 x11
    60:        0045a383        lw x7 4 x11
    64:        0127d463        bge x15 x18 8 <print>
    68:        fc684ae3        blt x16 x6 -44 <proc>

0000006c <print>:
    6c:        00072503        lw x10 0 x14
    70:        00100893        addi x17 x0 1
    74:        00000073        ecall
    78:        00470713        addi x14 x14 4
    7c:        ffd718e3        bne x14 x29 -16 <print>
    80:        10000517        auipc x10 0x10000
    84:        fe050513        addi x10 x10 -32
    88:        00400893        addi x17 x0 4
    8c:        00000073        ecall
    90:        00000793        addi x15 x0 0
    94:        001a0a13        addi x20 x20 1
    98:        00460613        addi x12 x12 4
    9c:        00468693        addi x13 x13 4
    a0:        f9ea40e3        blt x20 x30 -128 <main>
    a4:        0040006f        jal x0 4 <exit>

000000a8 <exit>:
    a8:        00a00893        addi x17 x0 10
    ac:        00000073        ecall

Reference