Try   HackMD

Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by scones525

Indexing of hierachical data structures by CLZ

In a complete binary tree, there is a certain relationship between the number of nodes and the depth.
For example, a tree with a depth of 2 has 3 to 4 nodes, and a tree with a depth of 3 has 7 to 8 nodes, and so on.

Example

Assuming we use a 64-bit binary representation for identifying each node, the number of leading zeros in the binary form of this integer can represent the level or depth of the node.

  • Root node (depth 1) ID is '0000001'(63 leading zeros).
  • Child of the root node ID are '0000010' and '0000011'(62 leading zeros).

We can notice that the relationship between depth and the number of leading zero is : 64 - the number of leading zeros.

Therefore, if we know the node's ID, we can quickly calculate its depth in a complete binary tree using the count_leading_zeros function.

locate a tree node in a complete binary tree

For example if we have a complete binary tree below:

     1
    / \
   2   3
  / \ / \
 4  5 6  7

How can we locate tree node "7" by clz?

Firstly, the clz of that tree node represents its depth.
Secondly, we can noticed that the leftmost node of each depth is exactly a multiple of 2.
We can use this characteristic to calculate its position from left to right.

For example, the binary representation of the number 7 is 111.
In a 32-bit system, we can quickly calculate its depth using clz(clz of 7 is 25, its depth is 32 - 25 = 5).
Then, subtracting the starting position of that level from each number gives its position(7-4=3, so 3 is the position of 7 in that level).

Here are some tests

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

You can check the code in complete_binary_tree_location.s

Implementation

This part include c code and translate it into assembly code.
Besides, program's functionality and explanations will be discussed in this subsection.

C code

#include <stdint.h> uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & 0x5555555555555555 ); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); }

Assembly code

.data # num1~num3 are test data num1: .word 0xf1ac,0x123 num2: .word 0x1ac,0x12123489 num3: .word 0x1,0x32498 and1: .word 0x33333333,0x33333333 and2: .word 0x0f0f0f0f,0x0f0f0f0f and3: .word 0x7f and4: .word 0x55555555, 0x55555555 and5: .word 0x33333333, 0x33333333 newline: .string "\n" .text main: la a0, num1 jal ra, counting_zero jal ra, count_ones jal ra, print_result la a0, num2 jal ra, counting_zero jal ra, count_ones jal ra, print_result la a0, num3 jal ra, counting_zero jal ra, count_ones jal ra, print_result j exit_program print_result: mv a0, t0 li a7, 1 ecall la a0, newline li a7, 4 ecall ret exit_program: li a7, 10 ecall counting_zero: mv t0, a0 j shift_right_or_equal jr ra shift_right_or_equal: #now data address in a0 #and do shift right and or equal 5 times. li s0, 1 li s1, 1 li s2, 6 #high 32bit in s3, low 32bit in s4 lw s3, 0(t0) lw s4, 4(t0) Loop: #s0 in this loop are : 1,2,4,8,16,32 #t0 : data address addi s1, s1, 1 li s6, 32 sub s6, s6, s0 # s6 32-n bits, how many bits that mask need to shift # 31,30,28,24,16,0 sll s7, s3, s6 #upper 32-n bit need to and w/ lower 32 bit #shift s3, s4 srl s8, s3, s0 srl s9, s4, s0 #add upper32bits shift's bit into lower 32bits or s9, s9, s7 or s3, s3, s8 or s4, s4, s9 slli s0, s0, 1 ble s1, s2, Loop mv t0, s3 mv t1, s4 ret count_ones: mv s0, t0 mv s1, t1 #x -= ((x >> 1) & 0x5555555555555555 ); slli s7, s0, 31 # s7 store bit that shift to lower 32bits srli s8, s0, 1 # x >> 1 srli s9, s1, 1 or s9, s9, s7 la t2, and4 lw t3, 0(t2) #0x55555555 lw t4, 4(t2) #0x55555555 and s8, s8, t3 # x and x05555555555555555 and s9, s9, t4 # do sub here s0,s1 - s8,s9 sub s1, s1, s9 sltu a3, s1, s9 sub s0, s0, s8 sub s0, s0, a3 #x = ((x >> 2) & 0x3333333333333333) + (x &0x3333333333333333); # s2 s3 # x >> 2 slli s7, s0, 30 srli s8, s0, 2 srli s9, s1, 2 or s9, s9, s7 # load 0x3333333333333333 la t2, and5 lw t3, 0(t2) lw t4, 4(t2) # and with 0x3333333333333333 and s8, s8, t3 and s9, s9, t4 # x &0x3333333333333333 store in s2, s3 and s2, s0, t3 and s3, s1, t4 # add together they are s8,s9 and s2,s3 respectively add s1, s9, s3 sltu a3, s1, s3 add s0, s8, s2 add s0, s0, a3 #x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; # x >> 4 slli s7, s0, 28 srli s8, s0, 4 srli s9, s1, 4 or s9, s9, s7 # (x>>4) + x store in s8,s9 add s9, s9, s1 sltu a3, s9, s1 add a4, s8, s0 add s8, a4, a3 # & 0x0f0f0f0f0f0f0f0f la t2, and2 lw t3, 0(t2) lw t4, 4(t2) and s0, s8, t3 and s1, s9, t4 # x += (x >> 8); slli s7, s0, 24 srli s8, s0, 8 srli s9, s1, 8 or s9, s9, s7 add s1, s9, s1 sltu a3, s1, s9 add s0, s0, s8 add s0, s0, a3 # x += (x >> 16); slli s7, s0, 16 srli s8, s0, 16 srli s9, s1, 16 or s9, s9, s7 add s1, s9, s1 sltu a3, s1, s9 add s0, s0, s8 add s0, s0, a3 # x += (x >> 32); mv s8, x0 mv s9, s0 add s1, s9, s1 sltu a3, s1, s9 add s0, s0, s8 add s0, s0, a3 li t0,64 andi t1, s1, 0x7f sub t0, t0, t1 ret

Explanation1: Data struct

In RV32i, register can only store 32-bit data.

However, uint_64_t is a long long integer data type which is guaranteed to be exactly 8 bytes in size.

So I store a 64 bits data in two registers in RV32i.
For example we can check lines 3 to 5 in Assembly code part above:

num1: .word 0xf1ac,0x123
num2: .word 0x1ac,0x12123489
num3: .word 0x1,0x32498

Represented in 64 bits:

num1 = 0x0000f1ac00000123
num2 = 0x000001ac12123489
num3 = 0x0000000100032498

Explanation2: Data shift

Similar to the previous question, there may be issues when shifting data.
Therefore, the bits shifted out from the high-order register must be stored and loaded into the low-order register.
For example:

# t0, t1 are two register which represent a 64bits # number we want to shift. # And in this case, we want to shift 5 bit. slli t2, t0, 27 #those bits that shift out from high 32bits register srli t0, t0, 5 #shift high 32bits register srli t1, t1, 5 #shift low 32bits register or t1, t1, t2 #Combine the lowest 5 bits of the high-order register with the 27 bits of the low-order register.

Analysis

The code was tested using the Ripes simulator.

And in this part, I will explain an example for each stage in Instruction pipeline( including IF, ID, IE, MEM, and WB.)

Here are the line numbers used for the CLZ function of the disassenbled code.

00000080 <count_ones>:
    80:        00028413        addi x8 x5 0
    84:        00030493        addi x9 x6 0
    88:        01f41b93        slli x23 x8 31
    8c:        00145c13        srli x24 x8 1
    90:        0014dc93        srli x25 x9 1
    94:        017cecb3        or x25 x25 x23
    98:        10000397        auipc x7 0x10000
    9c:        f8438393        addi x7 x7 -124
    a0:        0003ae03        lw x28 0 x7
    a4:        0043ae83        lw x29 4 x7
    a8:        01cc7c33        and x24 x24 x28
    ac:        01dcfcb3        and x25 x25 x29

This loop is the lower section of CLZ function.
And the instruction slli x23 x8 31 will be taken as an example.

Integer Register-Immediate Instructions.

Before we begin, let us check the The RISC-V Instruction Set Manual carefully.

Shifts by a constant are encoded as a specialization of the I-type format.The operand to be shifted
is in rs1, and the shift amount is encoded in the lower 5 bits of the I-immediate field.SLLI is a logical left shift (zeros are shiftedinto the lower bits);

IF stage


The address of instruction slli x23 x8 31 is 0x00000088. So it input 0x00000088 into the Instruction memory.
By the way, 0x0000008c refer to the next instruction address which is add by 4 from its above instruction pipeline.

You may notice that this instruction after passing through the instruction memory becomes 0x1f41b93.
But why?

As mention before in "Integer Register-Immediate Instructions" part, the "slli" instruction is divided into some parts

imm[11:0] | rs1 | funct3 | rd | opcode
  • imm[11:0] is a immediate number, so 31 is represented as 'b0000000011111' here.
  • rs1 means what register will be shift and x8 is 'b01000'.
  • func3 is used to differentiate instructions when multiple instructions share the same opcode. The code of slli func3 is 'b001'
  • rd means what register will be written back and x23 is 'b10111'.
  • opcode : slli is a I-type instruction, and they share the same opcode '0b0010011'.

To summarize, 0x01f41b93 will be fed into the IF/ID stage.

ID stage

  • After decoding, x8 and x31 are put into Register unit. Register Unit will read what value in there register, we can check them in 'GPR'.
    x8 :


    x31:

  • Immediate value '31' will be sent into Imm. unit.

  • On the same time, processor need to store write back register information to ID/EX.

EX stage

We can divide this part into two sections, ALU and write back information part.

For ALU op1, it stores register x8 value. In my risc-v code, I consider it as two stages:
This part can use loop to simplify it.

    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);
    x |= (x >> 32);

However, this part cannot not be done using a loop.

    x -= ((x >> 1) & 0x5555555555555555 );
    x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
    x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
    x += (x >> 8);
    x += (x >> 16);
    x += (x >> 32);

So, instruction slli x23 x8 31 is the first instruction of c code x -= ((x >> 1) & 0x5555555555555555 );
We can learn that 0x000000ff is the value after the loop dealing with it.

For op2, it stores immdeiately value '31' here which means how many bits x8 needs to shift.
After calaulating, we can conduct Res '0x80000000' to the next stage.

MEM stage

We can find an insteresting thing that is : Data Memory output is '0x00000000',
which means nothing is done here.
This is because slli is an integer logical instruction, the entire operation takes place entirely within the CPU's register file and the Arithmetic Logic Unit (ALU), with no involvement of the data memory.

WB stage


Here, we can discuss two values.
One is '0x80000000', the value that need to write back to 'x31' register.
Another is '0x17', as we mention in decode part before it shares the same means with 'x23' register.

In the same cycle, notice that thay have written back to Register Unit.
Wr data:


Wr idx:

Optimization

In the previous 64 bits version, we have used a lot of shift and or to implement how to count 64bits number zero.

However, counting a 64 bits number and counting a 32 bits number are actually the same thing.
Why?
Because it doesn't matter which register you count, all we need to do is count the leading zeros in the upper 32 bits or the lower 32 bits.

For example, if we have a test case:

# problem:0x0000000000000001 how many leading bits?
# save it into  two register s0, s1 respectively
# s0 store 0x00000000
# s1 store 0x00000001
res = 0;
if (s0 == 0){
    res += 32;
    res += clz(s1);
}
else {
    res += clz(s0);
}

If we use this method, it can reduce a significant amount of execution time and decrease code size theoretically.

You can check this file from this latest_hw1.s.

Before make this optimization:


After:

The biggest difference between the two can be divided into two points.

  1. Former requires calculations on two registers(because two 32bits register to represent a 64bits number).
  2. Additional bit shift calculations(There two 32bits register can not calculate directly, we must apply some method to implement it).