Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by < teimeikichengmingchi >

tags: `RISC-V`, `jserv`

Remove Duplicates from Sorted Array (LeetCode 26)

This is one of the "Easy" questions on LeetCode as requested in the assignment. The question is as follows.

Given an integer array nums sorted in non-decreasing order, remove the duplicates in-place such that each unique element appears only once. The relative order of the elements should be kept the same.

Since it is impossible to change the length of the array in some languages, you must instead have the result be placed in the first part of the array nums. More formally, if there are k elements after removing the duplicates, then the first k elements of nums should hold the final result. It does not matter what you leave beyond the first k elements.

Return k after placing the final result in the first k slots of nums.

Do not allocate extra space for another array. You must do this by modifying the input array in-place with O(1) extra memory.

Implementaion

You can find the source code here.

C code

Before assembly language, it is better for us to first implement the problem in some high level language. I selected C programming language for my first attemp to implement.

The following is my implementation:

int removeDuplicates(int* nums, int numSize){
    int shift = 0;
    for(int i = 1; i < numSize; i++){
        if(nums[i] == nums[i - 1])
            shift++;
        else
            nums[i - shift] = nums[i];
    }
    return numSize - shift;
}

assembly code

# function to remove duplicated values in an array
# removeDuplicates(int *nums, int numSize)
removeDuplicates:
    add s0, zero, zero          # shift = 0
    addi s1, zero, 4            # i = 1 * 4
    add s2, a0, zero            # nums' base address
    addi s2, s2, 4              # nums[1]'s address
    add s3, a1, zero
    slli s3, s3, 2              # numSize * 4
for:
    bge s1, s3, forEnd
# if:
    addi t0, s2, -4             # t0 is (i - 1) * 4
    lw t1, 0(s2)                # t1 is nums[i]
    lw t2, 0(t0)                # t2 is nums[i - 1]
    bne t1, t2, else
# then:
    addi s0, s0, 4              # (shift++) * 4
    j ifEnd
else:
    sub t0, s2, s0              # t0 is the address of nums[i - shift]
    sw t1, 0(t0)
ifEnd:
    addi s1, s1, 4
    addi s2, s2, 4
    j for
forEnd:
    sub a1, s3, s0
    srli a1, a1, 2
    jr ra

This is the core logic of whole program. The register a0 is the parameter "nums" in C code function and a1 is "numSize".

In C code, the counter in for-loop increase by 1 for each iteration, while in my assembly implementation increase by 4. The reason is if increase by 1, 3 slli instruction need to be conducted for each iteration to access array element, which overall conducting more instruction.

After reviewing my own code, I realized that there is no need to multiply counter for 4 in for loop.

The edition of this is at here.

You can find full source code here.

Although there is no "print" function defined in C implementation, I found it is too annoying to repeat print logic for three times thus directly implemented a print function in assembly code:

# function to print each value in an array
# print(int *nums, int numSize)
print:
    add t0, zero, zero          # i = 0
    add t1, a0, zero            # t1 is array's base address
    slli t2, a1, 2              # t2 is cnt * 4, multiplied because of the performance issue
                                # we will increase t0 by 4, so it is necessary to multiply the value of cnt by 4
printLoop:
    bge t0, t2, printLoopEnd
    lw a0, 0(t1)
    addi a7, zero, 1
    ecall
    
    la a0, space
    lw a0, 0(a0)                # load the ascii value of ' ' in address of space
    addi a7, zero, 11
    ecall
    
    addi t0, t0, 4              # i++
    addi t1, t1, 4              # go to next element in array
    j printLoop
printLoopEnd:
    la a0, newline
    lw a0, 0(a0)
    addi a7, zero, 11
    ecall
    jr ra

An intresting thing is that it is hard to find offical document of how to pass argument when using "ecall". Only on some educational simulators such as Ripes or venus can I found it and both of them pass argument in difrrent way.

RISC-V's official manual said that the calling convention might differ in diffrent environment.

The ECALL instruction is used to make a service request to the execution environment. The EEI will define how parameters for the service request are passed, but usually these will be in defined locations in the integer register file.

And I failed to find the ABI used in ubuntu system, so sad :(

An explanation of ABI
ABI for ecall

Edition of for loop

I realize that although the addresses increase by 4 each iteration, there is no need to increase counter by 4.

The following is an edition of new for loop.

# function to remove duplicated values in an array
# removeDuplicates(int *nums, int numSize)
removeDuplicates:
    add s0, zero, zero          # shift = 0
    addi s1, zero, 1
    add s2, a0, zero            # nums' base address
    addi s2, s2, 4              # nums[1]'s address
for:
    bge s1, a1, forEnd
# if:
    addi t0, s2, -4             # t0 is (i - 1) * 4
    lw t1, 0(s2)                # t1 is nums[i]
    lw t2, 0(t0)                # t2 is nums[i - 1]
    bne t1, t2, else
# then:
    addi s0, s0, 4              # (shift++) * 4
    j ifEnd
else:
    sub t0, s2, s0              # t0 is the address of nums[i - shift]
    sw t1, 0(t0)
ifEnd:
    addi s1, s1, 1
    addi s2, s2, 4
    j for
forEnd:
    srli s0, s0, 2
    sub a1, a1, s0
    jr ra

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Can you use fewer instructions by adjusting the order of branches?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

Edition 2 of for loop

I tried to implement for loop with less instruction and the following is my second edition of new for loop. Directly removed jump instruction in my code.

# function to remove duplicated values in an array
# removeDuplicates(int *nums, int numSize)
removeDuplicates:
    add s0, zero, zero          # shift = 0
    addi s1, zero, 1
    add s2, a0, zero            # nums' base address
    addi s2, s2, 4              # nums[1]'s address
for:
    bge s1, a1, forEnd
# if:
    addi t0, s2, -4             # t0 is (i - 1) * 4
    lw t1, 0(s2)                # t1 is nums[i]
    lw t2, 0(t0)                # t2 is nums[i - 1]
    bne t1, t2, else
# then:
    addi s0, s0, 4              # (shift++) * 4
#    j ifEnd
else:
    sub t0, s2, s0              # t0 is the address of nums[i - shift]
    sw t1, 0(t0)
ifEnd:
    addi s1, s1, 1
    addi s2, s2, 4
    j for
forEnd:
    srli s0, s0, 2
    sub a1, a1, s0
    jr ra

Auto judging

This is the auto judging code in my implementation. If the answer is same as inline answer, this function will print "true!" to the console otherwise print "false!".

# function to check the answer
# a0 array's base address
# a1 returned array's length
# a2 true length
# a3 true array's address
checkAns:
    add t0, a0, zero            # t0 is array's base address
    la a0, false                # default is false, if true, a0 will be modified in checkTrue
    li a7, 4
    bne a1, a2, checkAnsEnd
    add t1, zero, zero            # i = 0
checkAnsFor:
    bge t1, a2, checkTrue
    lw t2, 0(t0)                # t2 = a0[i]
    lw t3, 0(a3)                # t3 = a3[i]
    bne t2, t3 checkAnsEnd
    addi t1, t1, 1
    addi t0, t0, 4
    addi a3, a3, 4
    j checkAnsFor
checkTrue:
    la a0, true
checkAnsEnd:
    ecall
    jr ra

Analysis

IF

In IF stage, CPU simply fetch instruction form instruction memory according to program counter (PC).

In this picture, current PC value is zero and the corresponding value at that address is 0x10000517.

memory:
machine code:

ID

In ID stage, CPU will decode the instruction fetched by previous stage. The value of instruction is 0x10000517, which can be decoded into "auipc x10 0x10000". And decoder will based on this result select two register to become input for next stage.

The immediate value 0x10000 will shift left by 12 bits to fit in upper 20 bits, so it becomes 0x10000000 after decode.

EX

In EXE stage, the instruction will be executed. There are four multiplexer to decide input value of ALU. And the control of these four multiplexer is done by hazard detection unit and decoded information from previous stage.

values in x8 and x18

MEM

In MEM stage, instruction that modify main memory such as load word and save word will be done.

After checking the behaviour of Ripes, I think in this stage the processor only access the address and value, if the write(save) target is memory, the actual write timing will be done in next clock cycle. By contrast, if the target is register, actuall timing would be done in the next cycle of WB.

please inform me if I was wrong, thanks!

message by maroma samsa

In this kind of instruction, the target memory address is the computation result of last stage.

There is a control signal to decide whether the value stored in memory could be changed. If the instruction is load type instruction, this control signal will be triggered.

Modified

I saw the comment from maroma samsa and found that I write somthing wrong.
The following is the revision.

In MEM stage, instruction which modified memory's values such as sw, sh and sb will be done. Although in ripes simulator the memory value seems to be changed after this stage, I think it is because main memory is the slowest component in CPU so the value change in the last moment.

I think it is impossible for CPU to store value into memory after MEM stage because we only access one component in each cycle. The save instruction must done before the last moment of MEM stage.

And instruction fetching memory's value will get the value and pass to next pipelining stage, waiting for write back to register file.

WB

In WB stage, the result get from EXE stage or MEM stage should be writed back to register file.

There is a multiplexer to decide what value should be writed back. And a write enable signal will decide whether the value stored in register could be writed or not.

the result of x18 - x8 = 0x1000000c should be writed into x5

Hazard

When executing instruction, some time we might encounter with hazard, which is a phenomena only occur on pipeline CPU.

When insturction is done one after another, there is no nay problem, but when pipelined, a instruction might use result another instruction as an input, which not yet be executed entirely. So the result will become unpredictable.

There are some way to deal with this problem, including forwarding or insert some nop (no operation) or flush.

Read-after-write hazard

A raw (read-after-write) hazard means a subsequent instruction read value from a register which sould be written by the calculation result of previous instruction, but because of pipelineing issue haven't. We use a skill known as forwarding to solve this problem. That is, directely read result from previous instruction rather than register file.

The value stored in x10 is 0x00000000, while we expect 0x10000000. So we use forwarding to directly get ther result of previous instruction "auipc x10 0x10000"

Load-use hazard

Load-use hazard means a subsequent instruction read value from a register which sould be written by the memory value read from previous instruction, but because of pipelineing issue haven't. We can't use a simple forwarding to solve this problem because ID stage is 2 stage away from MEM stage. Thus, we will insert a nop between them to solve the hazard.

Control hazard

Control hazard means when a branch or jump instruction decide to go to another address in instruction memory, CUP had bring some instruction must be discarded into pipeline.

Because we decide whether to branch in EXE stage, all the instruction in prior stages should be discared.

We use flush to discard instruction in pipline.

Step separating register

The step separating register is a register between each pipline stage, storing value passed by last stage and wait for next clock cycle. In this way, pipline system can be synchronized by two control signal in this picture.

First is enable signal. If triggered, separating register will pass stored value to the next stage.

Next is clear signal. If triggered, the value currently stored in separating register will be flushed to zero.

We can use this two signal to insert nop (stall) or change some instruction into nop (flush).

Stall

Flush

To do & Compare with MIPS32

I also find somting weired when analysing machine code, but haven't figured out why. Mabey they can be solved in later lectures or I should use some time to figure them out later.

Because I have learned MIPS32 in undergraduate school, I will compare this two ISA in following paragraph.

Half-word addressing mode ??!!

By observing instruction set and CPU architecture, I think RV32 can access instruction without alignment. This is very incrediable for me and I don't know whether this is my misunderstanding or not.

In this picture, we can see that PC will be added with either 2 or 4 and eventually become the next PC value, which means instruction address may not be a multiple of 4.

Also, both of conditional branch and unconditional jump only shift offset by one. This means RV32 may can use half-word addressing mode.

But I also found some document say that RV32 can only use word addressing.

All of RV32's instruction are start with "11"

I found that every instruction has "11" in their lowwest two bits [1:0] in RV32 instrction set.

(source: RV32 manual)
I have'nt check whether all of the instrction are same, but at first glance, along with some extension stuff written on manual, they indeed have same bits in lowwest two bits.

I think this is quite weird because it means RV32 ISA abandon the capability to represent lager immediate number and I don't know the reason of this design.

Unconditional jump is a pseudo instruction

In MIPS32, an unconditional jump has 26 bits to describe the destination's word address, so it can goto anywhere inside the valid code section by concating with PC upper address (PC[31:28]).

(img source)

While in RV32 instruction set, we implement unconditional jump with jal, which can only provide 20 bits to describe destination's byte address. It sacrifice the range of jump to reduce the size of instruction set.

And we can also realise the offset of RV32 is out of order in this picture.

(source: RV32 manual)

	MIPS32	RV32
how many bits for addressing?	26bits	20bits
range	$2^{28} = 256 M b$	$2^{21} = 2 M b$

The figure shows that the range of unconditional jump in MIPS is 128 times higher than RV32.

If we want to have a larger jump range in RV32, we can implement by combining AUIPC and JALR to get a 32-bits PC-relative range.

Although the range seems larger than MIPS', 28-bits range is enough for MIPS because the code section never exceed this range in MIPS32, at least I was taught in this way in university course. So I think in perspective of unconditional jump, MIPS32 is more effcient than RV32.

TODO:

I have learn that MIPS32 will execute on the range from {PC[31:28], 28'b0} to {PC[31:28], 28'b1}.

Does MIPS32 really never exceed this range?