Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by scones525

Indexing of hierachical data structures by CLZ

In a complete binary tree, there is a certain relationship between the number of nodes and the depth.
For example, a tree with a depth of 2 has 3 to 4 nodes, and a tree with a depth of 3 has 7 to 8 nodes, and so on.

Example

Assuming we use a 64-bit binary representation for identifying each node, the number of leading zeros in the binary form of this integer can represent the level or depth of the node.

Root node (depth 1) ID is '000…0001'(63 leading zeros).
Child of the root node ID are '000…0010' and '000…0011'(62 leading zeros).

We can notice that the relationship between depth and the number of leading zero is : 64 - the number of leading zeros.

Therefore, if we know the node's ID, we can quickly calculate its depth in a complete binary tree using the count_leading_zeros function.

locate a tree node in a complete binary tree

For example if we have a complete binary tree below:

How can we locate tree node "7" by clz?

Firstly, the clz of that tree node represents its depth.
Secondly, we can noticed that the leftmost node of each depth is exactly a multiple of 2.
We can use this characteristic to calculate its position from left to right.

For example, the binary representation of the number 7 is 111.
In a 32-bit system, we can quickly calculate its depth using clz(clz of 7 is 25, its depth is 32 - 25 = 5).
Then, subtracting the starting position of that level from each number gives its position(7-4=3, so 3 is the position of 7 in that level).

Here are some tests…

You can check the code in complete_binary_tree_location.s

Implementation

This part include c code and translate it into assembly code.
Besides, program's functionality and explanations will be discussed in this subsection.

C code




















#include <stdint.h>
uint16_t count_leading_zeros(uint64_t x)
{
    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);
    x |= (x >> 32);

    /* count ones (population count) */
    x -= ((x >> 1) & 0x5555555555555555 );
    x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
    x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
    x += (x >> 8);
    x += (x >> 16);
    x += (x >> 32);

    return (64 - (x & 0x7f));
}

Assembly code




















































































































































































.data
# num1~num3 are test data
num1: .word 0xf1ac,0x123
num2: .word 0x1ac,0x12123489
num3: .word 0x1,0x32498
and1: .word 0x33333333,0x33333333
and2: .word 0x0f0f0f0f,0x0f0f0f0f
and3: .word 0x7f
and4: .word 0x55555555, 0x55555555
and5: .word 0x33333333, 0x33333333
newline: .string "\n"
.text
main:
    la a0, num1
    jal ra, counting_zero
    jal ra, count_ones
    jal ra, print_result

    la a0, num2
    jal ra, counting_zero
    jal ra, count_ones
    jal ra, print_result    
    
    la a0, num3
    jal ra, counting_zero
    jal ra, count_ones
    jal ra, print_result    
    j exit_program

print_result:
    mv a0, t0
    li a7, 1
    ecall
    
    la a0, newline
    li a7, 4
    ecall
    ret
exit_program:
    li a7, 10
    ecall
counting_zero:
    mv t0, a0

    j shift_right_or_equal      
    jr ra   
shift_right_or_equal:     
    #now data address in a0 
    #and do shift right and or equal 5 times.
    li s0, 1
    li s1, 1
    li s2, 6
    #high 32bit in s3, low 32bit in s4
    lw s3, 0(t0)
    lw s4, 4(t0)
Loop:    #s0 in this loop are : 1,2,4,8,16,32
         #t0 : data address
    addi s1, s1, 1
    li s6, 32
    sub s6, s6, s0 # s6 32-n bits, how many bits that mask need to shift
                   #  31,30,28,24,16,0
    
    sll s7, s3, s6 #upper 32-n bit need to and w/ lower 32 bit
    #shift s3, s4
    srl s8, s3, s0
    srl s9, s4, s0
    #add upper32bits shift's bit into lower 32bits
    or s9, s9, s7
    
    or s3, s3, s8
    or s4, s4, s9 
    slli s0, s0, 1
    ble s1, s2, Loop
    
    mv t0, s3
    mv t1, s4
    ret

count_ones:
    mv s0, t0
    mv s1, t1

    #x -= ((x >> 1) & 0x5555555555555555 );
    slli s7, s0, 31      # s7 store bit that shift to lower 32bits
    srli s8, s0, 1       # x >> 1
    srli s9, s1, 1
    or s9, s9, s7   
    la t2, and4
    lw t3, 0(t2)         #0x55555555
    lw t4, 4(t2)         #0x55555555
    
    and s8, s8, t3       #  x and x05555555555555555
    and s9, s9, t4
    # do sub here s0,s1 - s8,s9
    sub s1, s1, s9
    sltu a3, s1, s9
    sub s0, s0, s8
    sub s0, s0, a3

    #x = ((x >> 2) & 0x3333333333333333) + (x &0x3333333333333333);
    #              s2                                 s3
    
    # x >> 2
    slli s7, s0, 30
    srli s8, s0, 2
    srli s9, s1, 2
    or s9, s9, s7
    # load 0x3333333333333333
    la t2, and5
    lw t3, 0(t2)
    lw t4, 4(t2)
    # and with 0x3333333333333333
    and s8, s8, t3
    and s9, s9, t4
    # x &0x3333333333333333 store in s2, s3
    and s2, s0, t3
    and s3, s1, t4
    # add together they are s8,s9 and s2,s3 respectively
    add s1, s9, s3
    sltu a3, s1, s3
    add s0, s8, s2
    add s0, s0, a3

    #x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
    
        # x >> 4
    slli s7, s0, 28
    srli s8, s0, 4
    srli s9, s1, 4
    or s9, s9, s7
    
        # (x>>4) + x store in s8,s9
    add s9, s9, s1
    sltu a3, s9, s1
    add a4, s8, s0
    add s8, a4, a3

        # & 0x0f0f0f0f0f0f0f0f
    la t2, and2 
    lw t3, 0(t2)
    lw t4, 4(t2)
    and s0, s8, t3
    and s1, s9, t4

    # x += (x >> 8);
    slli s7, s0, 24
    srli s8, s0, 8
    srli s9, s1, 8
    or s9, s9, s7

    add s1, s9, s1
    sltu a3, s1, s9
    add s0, s0, s8
    add s0, s0, a3

    # x += (x >> 16);
    slli s7, s0, 16
    srli s8, s0, 16
    srli s9, s1, 16
    or s9, s9, s7

    add s1, s9, s1
    sltu a3, s1, s9
    add s0, s0, s8
    add s0, s0, a3    
    

    # x += (x >> 32);
    mv s8, x0
    mv s9, s0

    add s1, s9, s1
    sltu a3, s1, s9
    add s0, s0, s8
    add s0, s0, a3
    
    li t0,64
    andi t1, s1, 0x7f
    sub t0, t0, t1
    ret

Explanation1: Data struct

In RV32i, register can only store 32-bit data.

However, uint_64_t is a long long integer data type which is guaranteed to be exactly 8 bytes in size.

So I store a 64 bits data in two registers in RV32i.
For example we can check lines 3 to 5 in Assembly code part above:

num1: .word 0xf1ac,0x123
num2: .word 0x1ac,0x12123489
num3: .word 0x1,0x32498

Represented in 64 bits:

num1 = 0x0000f1ac00000123
num2 = 0x000001ac12123489
num3 = 0x0000000100032498

Explanation2: Data shift

Similar to the previous question, there may be issues when shifting data.
Therefore, the bits shifted out from the high-order register must be stored and loaded into the low-order register.
For example:







# t0, t1 are two register which represent a 64bits
# number we want to shift.
# And in this case, we want to shift 5 bit.
slli t2, t0, 27    #those bits that shift out from high 32bits register
srli t0, t0, 5     #shift high 32bits register
srli t1, t1, 5     #shift low  32bits register
or t1, t1, t2      #Combine the lowest 5 bits of the high-order register with the 27 bits of the low-order register.

Analysis

The code was tested using the Ripes simulator.

And in this part, I will explain an example for each stage in Instruction pipeline( including IF, ID, IE, MEM, and WB.)

Here are the line numbers used for the CLZ function of the disassenbled code.

00000080 <count_ones>:
    80:        00028413        addi x8 x5 0
    84:        00030493        addi x9 x6 0
    88:        01f41b93        slli x23 x8 31
    8c:        00145c13        srli x24 x8 1
    90:        0014dc93        srli x25 x9 1
    94:        017cecb3        or x25 x25 x23
    98:        10000397        auipc x7 0x10000
    9c:        f8438393        addi x7 x7 -124
    a0:        0003ae03        lw x28 0 x7
    a4:        0043ae83        lw x29 4 x7
    a8:        01cc7c33        and x24 x24 x28
    ac:        01dcfcb3        and x25 x25 x29

This loop is the lower section of CLZ function.
And the instruction slli x23 x8 31 will be taken as an example.

Integer Register-Immediate Instructions.

Before we begin, let us check the The RISC-V Instruction Set Manual carefully.

Shifts by a constant are encoded as a specialization of the I-type format.The operand to be shifted
is in rs1, and the shift amount is encoded in the lower 5 bits of the I-immediate field.SLLI is a logical left shift (zeros are shiftedinto the lower bits);

IF stage

The address of instruction slli x23 x8 31 is 0x00000088. So it input 0x00000088 into the Instruction memory.
By the way, 0x0000008c refer to the next instruction address which is add by 4 from its above instruction pipeline.

You may notice that this instruction after passing through the instruction memory becomes 0x1f41b93.
But why?

As mention before in "Integer Register-Immediate Instructions" part, the "slli" instruction is divided into some parts…

imm[11:0] | rs1 | funct3 | rd | opcode

imm[11:0] is a immediate number, so 31 is represented as 'b0000000011111' here.
rs1 means what register will be shift and x8 is 'b01000'.
func3 is used to differentiate instructions when multiple instructions share the same opcode. The code of slli func3 is 'b001'
rd means what register will be written back and x23 is 'b10111'.
opcode : slli is a I-type instruction, and they share the same opcode '0b0010011'.

To summarize, 0x01f41b93 will be fed into the IF/ID stage.

ID stage

After decoding, x8 and x31 are put into Register unit. Register Unit will read what value in there register, we can check them in 'GPR'.
x8 :

x31:
Immediate value '31' will be sent into Imm. unit.
On the same time, processor need to store write back register information to ID/EX.

EX stage

We can divide this part into two sections, ALU and write back information part.

For ALU op1, it stores register x8 value. In my risc-v code, I consider it as two stages:
This part can use loop to simplify it.

    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);
    x |= (x >> 32);

However, this part cannot not be done using a loop.

    x -= ((x >> 1) & 0x5555555555555555 );
    x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
    x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
    x += (x >> 8);
    x += (x >> 16);
    x += (x >> 32);

So, instruction slli x23 x8 31 is the first instruction of c code x -= ((x >> 1) & 0x5555555555555555 );
We can learn that 0x000000ff is the value after the loop dealing with it.

For op2, it stores immdeiately value '31' here which means how many bits x8 needs to shift.
After calaulating, we can conduct Res '0x80000000' to the next stage.

MEM stage

We can find an insteresting thing that is : Data Memory output is '0x00000000',
which means nothing is done here.
This is because slli is an integer logical instruction, the entire operation takes place entirely within the CPU's register file and the Arithmetic Logic Unit (ALU), with no involvement of the data memory.

WB stage

Here, we can discuss two values.
One is '0x80000000', the value that need to write back to 'x31' register.
Another is '0x17', as we mention in decode part before it shares the same means with 'x23' register.

In the same cycle, notice that thay have written back to Register Unit.
Wr data:

Wr idx:

Optimization

In the previous 64 bits version, we have used a lot of shift and or to implement how to count 64bits number zero.

However, counting a 64 bits number and counting a 32 bits number are actually the same thing.
Why?
Because it doesn't matter which register you count, all we need to do is count the leading zeros in the upper 32 bits or the lower 32 bits.

For example, if we have a test case:

# problem:0x0000000000000001 how many leading bits?
# save it into  two register s0, s1 respectively
# s0 store 0x00000000
# s1 store 0x00000001
res = 0;
if (s0 == 0){
    res += 32;
    res += clz(s1);
}
else {
    res += clz(s0);
}

If we use this method, it can reduce a significant amount of execution time and decrease code size theoretically.

You can check this file from this latest_hw1.s.

Before make this optimization:

After:

The biggest difference between the two can be divided into two points.

Former requires calculations on two registers(because two 32bits register to represent a 64bits number).
Additional bit shift calculations(There two 32bits register can not calculate directly, we must apply some method to implement it).