Try   HackMD

Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by < ChengChiTing >

tags: RISC-V, jserv

Implement Collatz conjecture by ctz

Collatz conjecture

The Collatz conjecture is one of the most famous unsolved problems in mathematics. It was posed by L. Collatz in 1937, also called the "3x+1 mapping", "3n+1 problem", "Hasse's algorithm", "Kakutani's problem", "Syracuse algorithm", "Syracuse problem", "Thwaites conjecture", and "Ulam's problem".

The conjecture asks whether repeating two simple arithmetic operations will eventually transform every "positive integer" into "1". It concerns sequences of integers in which each term is obtained from the previous term as follows:

(1) if the previous term is even, the next term is one half of the previous term (N = N/2).
(2) If the previous term is odd, the next term is 3 times the previous term plus 1 (N = 3N +1).

Example: if we choose number "7".

7 -> 22 -> 11 -> 34 -> 17 -> 52 -> 26 -> 13 -> 40 -> 20 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1
It needs 16 steps to reach "1".

The conjecture is that these sequences always reach 1, no matter which positive integer is chosen to start the sequence.

Implementation

You can find the source code here. Feel free to fork and modify it.

How to achieve goal

When we input a test number"N", the function can calcute how many steps need for N to reach "1". There are two situation have to deal with, N = 0 and N = 1.
If N = 0, the function will exit and print "The number is zero".
If N = 1, it is defined as needing zero step to reach 1.

The loop has to confirm N is even or odd, first. After that, we can use right shift to make N/2 when N is even. If N is odd, we can use left shift to make 2N. Then plus previous N and 1 to get 3N + 1. The code write in C and RISC-V Assembly Language are shown below.

C code Version 1

The code below is an implementation of Collatz conjecture.

#include <stdio.h> #include <stdint.h> #define data 7 // number N // count_trailing_zeros function uint16_t ctz(uint32_t x) { x |= (x << 1); x |= (x << 2); x |= (x << 4); x |= (x << 8); x |= (x << 16); // count ones (population count) x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x7f)); } // count how many steps for finishing Collatz conjecture uint16_t collatz(uint32_t x){ if(x == 0){ printf("The number is zero \n"); return 0; } uint16_t result = 0; uint16_t y = 0; while(x != 1){ if((x & 0x01) == 1){ x += ((x << 1) + 1); result ++; } else{ y = ctz(x); x = x >> y; result += y; } } printf("The number of steps for Collatz conjecture is %u", result); return result; } int main (){ uint32_t x = data; x = collatz(x); return 0; }

The above C code can be separated into two parts, count_trailing_zeros and Collatz conjecture.

Count Trailing Zeros

The count_trailing_zeros function(ctz) is derived from Quiz 1 count_leading_zeros function(clz).It can calcute the trailing zero of a 32-bit integers. A trailing zero is defined as any ‘0’ digit that appears after the last non-zero digit in the binary representation of a number.

Example: Input : 6720 -> Output : 6.

Explanation: As Binary of 6720 = (00000000000000000001101001000000)
There are 6 zeros appear after the last non-zero digit

The original C code in Quiz 1 Problem A is below.

#include <stdint.h> uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); // count ones (population count) x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 16); return (64 - (x & 0x7f)); }

The above code complete two tasks. First, change all the bit on the right side of the first non-zero bit into "1", and left side of the first non-zero bit into "0". Second, implement population count to get the number of "1" in the 64-bit binary. Thus, if we subtract the number of "1" from 64, we can get the number of leading zero.

Now, we rewrite the above code and change it into 32-bit version.

x |= (x << 1); x |= (x << 2); x |= (x << 4); x |= (x << 8); x |= (x << 16);

After rewrite the code, we can change all the bit on the right side of the last non-zero bit into "1", and left side of the last non-zero bit into "0" and then we also implement population to get number of "1".Thus, we can get the number of trailing zero

Collatz conjecture function

The first thing we have to do is check input number N is zero or not. If N is not zero we can enter the loop and then we need to confirm N is even or odd.

If N is odd we left shift one bit of N and plus 1. Add result to the original N then we can get "3N + 1".

If N is even we right shift one bit of N to get "N/2".In the code above, if the factor of N is power of 2 we can use ctz to count the trailing zero and and make right shift once time.

Example: Input : 560 -> Output : 35.

Explanation: As Binary of 560 = (00000000000000000000001000110000)
After right shift = (00000000000000000000000000100011)

The counter will plus 1 if N is odd, and plus the number of trailing zero when N is even.
Finally, when N is 1, the program will exit and print how many steps need for N converting into 1.

Assembly code Version 1

.data data0: .word 0 # N = 0 data1: .word 1 # 0 steps data2: .word 7 # 16 steps data3: .word 1024 # 10 steps data4: .word 27 # 111 steps str1: .string "The number is zero" str2: .string "The number of steps for Collatz conjecture is " .text # s1 : test data # a0 : the input number of collatz function # a1 : the steps need for finishing collatz conjecture # t0 : the input of ctz function # a5 : the return value of ctzfunction # ra : return address main: lw s1, data2 mv a0, s1 jal ra, collatz # call collatz function li a7, 10 # exit program ecall collatz: addi sp, sp, -4 # push s0 sw ra, 0(sp) # store ra into s0 add a1, zero, zero # result(a1) = 0 addi a2, zero, 1 # a2 = 0x01 bne zero, a0, loop # confirm a0 is 0 or not mv t3, a0 # printf("The number is zero \n") la a0, str1 li a7, 4 ecall mv a0, t3 lw ra, 0(sp) addi sp, sp, 4 # return s0 jr ra # end function loop: beq a0, a2, end # if number = 1, end the function and a3, a0, a2 # confirm the number is even or odd bne a3, a2, even slli a3, a0, 1 # ood execute 3n + 1 addi a3, a3, 1 add a0, a3, a0 addi a1, a1, 1 j loop even: mv t0, a0 # even number execute n / 2 jal ra, ctz # call ctz function srl a0, a0, a5 # remove trailing zero add a1, a1, a5 j loop end: mv t3, a0 # printf("The number of steps for Collatz conjecture is "); la a0, str2 li a7, 4 ecall mv a0, a1 li a7, 1 ecall mv a0, t3 lw ra, 0(sp) addi sp, sp, 4 # return s0 jr ra # end function ctz: # counting trailing zero slli t1, t0, 1 # x |= (x << 1); or t0, t0, t1 slli t1, t0, 2 # x |= (x << 2); or t0, t0, t1 slli t1, t0, 4 # x |= (x << 4); or t0, t0, t1 slli t1, t0, 8 # x |= (x << 8); or t0, t0, t1 slli t1, t0, 16 # x |= (x << 16); or t0, t0, t1 # count ones (population count) li t1, 0x55555555 # 0x55555555 srli t2, t0, 1 # x >> 1 and t2, t1, t2 # (x >> 1) & 0x55555555 sub t0, t0, t2 # x -= ((x >> 1) & 0x55555555) li t1, 0x33333333 # 0x33333333 srli t2, t0, 2 # x >> 2 and t2, t1, t2 # (x >> 2) & 0x33333333 and t0, t0, t1 # x & 0x33333333 add t0, t0, t2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) li t1, 0x0f0f0f0f # 0x0f0f0f0f srli t2, t0, 4 # x >> 4 add t0, t2, t0 # (x >> 4) + x and t0, t0, t1 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t2, t0, 8 # x >> 8 add t0, t0, t2 # x += (x >> 8) srli t2, t0, 16 # x >> 16 add t0, t0, t2 # x += (x >> 16) addi t1, zero,32 andi t0, t0, 0x7f # x & 0x7f sub a5, t1, t0 jr ra

C code Version 2

Rewrite the count trailing zero function in byte shift. This method is similar to binary search. It works by repeatedly dividing in half the portion of the list that could contain the item, until We find the item we want .
We make left shift first to check the last non-zero is on the left side or reght side. Then we make shift again and check the position of the last non-zero repeatedly until we find it. The counter will counter the trailing zeros during shifting.
The program costs log2(32) = 5 times for counting trailing zero.

The C code below is ctz in byte shift version.

uint16_t ctz(uint32_t x) // count_trailing_zeros function { if (x == 0) return 32; // if x = 0, return 32 immediately int n = 1; // n is the number of trailing zeros if ((x << 16) == 0) { n += 16; x >>= 16; } if ((x << 24) == 0) { n += 8; x >>= 8; } if ((x << 28) == 0) { n += 4; x >>= 4; } if ((x << 30) == 0) { n += 2; x >>= 2; } n = n - (x & 1); return n; }

The Assembly code below is an implementation of Collatz conjecture in byte shift version.
I rewrite some part of code, and use stack to reduce the using of registers.

Assembly code Version 2

.data data0: .word 0 # N = 0 data1: .word 1 # 0 steps data2: .word 7 # 16 steps data3: .word 1024 # 10 steps data4: .word 27 # 111 steps str1: .string "The number is zero" str2: .string "The number of steps for Collatz conjecture is " .text # s0 : test data # s1 : the steps need for finishing collatz conjecture # ra : return address main: lw s0, data2 jal ra, collatz # call collatz function li a7, 10 # exit program ecall collatz: addi sp, sp, -12 sw ra, 0(sp) # store ra into 0(sp) sw s0, 4(sp) # store s0 into 4(sp) add s1, zero, zero # result(s1) = 0 addi t0, zero, 1 # t0 = 1 bne zero, s0, loop # confirm a0 is 0 or not la a0, str1 li a7, 4 ecall addi sp, sp, 12 jr ra # end function loop: beq s0, t0, end # if number = 1, end the function andi t1, s0, 1 # confirm the number is even or odd bne t1, t0, even slli t2, s0, 1 # ood execute 3n + 1 addi t2, t2, 1 add s0, t2, s0 addi s1, s1, 1 j loop even: jal ra, ctz # call ctz function srl s0, s0, t1 # remove trailing zero add s1, s1, t1 j loop end: la a0, str2 # printf("The number of steps for Collatz conjecture is ") li a7, 4 ecall mv a0, s1 li a7, 1 ecall lw ra, 0(sp) # return ra lw s0, 4(sp) # return s0 addi sp, sp, 12 # return stack jr ra # end function ctz: # counting trailing zero sw s0, 8(sp) # store s0 into 8(sp) addi t1, zero, 1 slli t2, s0, 16 bne t2, zero, bs24 addi t1, t1, 16 srli s0, s0, 16 bs24: slli t2, s0, 24 bne t2, zero, bs28 addi t1, t1, 8 srli s0, s0, 8 bs28: slli t2, s0, 28 bne t2, zero, bs30 addi t1, t1, 4 srli s0, s0, 4 bs30: slli t2, s0, 30 bne t2, zero, bs31 addi t1, t1, 2 srli s0, s0, 2 bs31: andi s0, s0, 1 sub t1, t1, s0 lw s0, 8(sp) # return s0 into 8(sp) jr ra

Analysis

We test our code using Ripes simulator.

Pseudo instruction

Put assembly code above into the editor of Ripes and we will see Ripe doesn't execute it literally. Instead, it replace pseudo instruction (p.110) into equivalent one, and change register name from ABI name to Register name.
The translated code of Version 2 by Ripes looks like below :

00000000 <main>:
     0:        10000417        auipc x8 0x10000
     4:        00842403        lw x8 8 x8
     8:        00c000ef        jal x1 12 <collatz>
     c:        00a00893        addi x17 x0 10
    10:        00000073        ecall

00000014 <collatz>:
    14:        ff410113        addi x2 x2 -12
    18:        00112023        sw x1 0 x2
    1c:        00812223        sw x8 4 x2
    20:        000004b3        add x9 x0 x0
    24:        00100293        addi x5 x0 1
    28:        00801e63        bne x0 x8 28 <loop>
    2c:        10000517        auipc x10 0x10000
    30:        fe850513        addi x10 x10 -24
    34:        00400893        addi x17 x0 4
    38:        00000073        ecall
    3c:        00c10113        addi x2 x2 12
    40:        00008067        jalr x0 x1 0

00000044 <loop>:
    44:        02540863        beq x8 x5 48 <end>
    48:        00147313        andi x6 x8 1
    4c:        00531c63        bne x6 x5 24 <even>
    50:        00141393        slli x7 x8 1
    54:        00138393        addi x7 x7 1
    58:        00838433        add x8 x7 x8
    5c:        00148493        addi x9 x9 1
    60:        fe5ff06f        jal x0 -28 <loop>

00000064 <even>:
    64:        03c000ef        jal x1 60 <ctz>
    68:        00645433        srl x8 x8 x6
    6c:        006484b3        add x9 x9 x6
    70:        fd5ff06f        jal x0 -44 <loop>

00000074 <end>:
    74:        10000517        auipc x10 0x10000
    78:        fb350513        addi x10 x10 -77
    7c:        00400893        addi x17 x0 4
    80:        00000073        ecall
    84:        00048513        addi x10 x9 0
    88:        00100893        addi x17 x0 1
    8c:        00000073        ecall
    90:        00012083        lw x1 0 x2
    94:        00412403        lw x8 4 x2
    98:        00c10113        addi x2 x2 12
    9c:        00008067        jalr x0 x1 0

000000a0 <ctz>:
    a0:        00812423        sw x8 8 x2
    a4:        00100313        addi x6 x0 1
    a8:        01041393        slli x7 x8 16
    ac:        00039663        bne x7 x0 12 <bs24>
    b0:        01030313        addi x6 x6 16
    b4:        01045413        srli x8 x8 16

000000b8 <bs24>:
    b8:        01841393        slli x7 x8 24
    bc:        00039663        bne x7 x0 12 <bs28>
    c0:        00830313        addi x6 x6 8
    c4:        00845413        srli x8 x8 8

000000c8 <bs28>:
    c8:        01c41393        slli x7 x8 28
    cc:        00039663        bne x7 x0 12 <bs30>
    d0:        00430313        addi x6 x6 4
    d4:        00445413        srli x8 x8 4

000000d8 <bs30>:
    d8:        01e41393        slli x7 x8 30
    dc:        00039663        bne x7 x0 12 <bs31>
    e0:        00230313        addi x6 x6 2
    e4:        00245413        srli x8 x8 2

000000e8 <bs31>:
    e8:        00147413        andi x8 x8 1
    ec:        40830333        sub x6 x6 x8
    f0:        00812403        lw x8 8 x2
    f4:        00008067        jalr x0 x1 0

In each row it denotes address in instruction memory, instruction's machine code (in hex) and instruction itself respectively.

Output Result

We test five different test data (0, 1, 7, 27, 1024) respectively, and the results are shown below:

Test Data Output Result
0
1
7
27
1024

7 = {7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1}

27 = {27, 82, 41, 124, 62, 31, 94, 47, 142, 71, 214, 107, 322, 161, 484, 242, 121, 364, 182, 91, 274, 137, 412, 206, 103, 310, 155, 466, 233, 700, 350, 175, 526, 263, 790, 395, 1186, 593, 1780, 890, 445, 1336, 668, 334, 167, 502, 251, 754, 377, 1132, 566, 283, 850, 425, 1276, 638, 319, 958, 479, 1438, 719, 2158, 1079, 3238, 1619, 4858, 2429, 7288, 3644, 1822, 911, 2734, 1367, 4102, 2051, 6154, 3077, 9232, 4616, 2308, 1154, 577, 1732, 866, 433, 1300, 650, 325, 976, 488, 244, 122, 61, 184, 92, 46, 23, 70, 35, 106, 53, 160, 80, 40, 20, 10, 5, 16, 8, 4, 2, 1}

1024 = {1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1}

Accoding to the table above, the code can execute in Ripes and print the answer correctly. Furthermore, if we execute C code above in C compiler we can also get same output.

Assembly code optimization

Comparing to Assembly code verision 1, version 2 use byte shift to accelerate trailing zero counting. After optimization, version 2 use less cycles to finish the Collatz conjecture program.

The cycles used by version 1 and version 2 are shown below:

Test Data Version 1 version 2
0
1
7
27
1024

Accoding to the table above, we find version 2 cost less clcyes in every data, but this phenomenon isn't obvious in "0" and "1". The possible reasons may be they didn't enter the loop. Furthermore, we can notice when the number of steps increase, the difference between the two version is greater(e.g. N = 7, N = 27). If we use the power of 2 , like 1024 to run program, we can get the same result.
It can be seen that byte shift ctz can optimize Collatz conjecture function and make program be more efficient.

5-stage pipelined processor

Ripes provide four kinds of processor for us to choose, including single cycle processor, 5-stage processor, 5-stage with hazard detection and 5-stage with forward and hazard detection. Here we choose the 5 stage processor. Its block diagram look like this:

The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The simplicity of operations performed allows every instruction to be completed in one processor cycle.The 5-stages are:

(1) Instruction fetch (IF):
In the Fetch stage, instruction is being fetched from the memory.

(2) Instruction decode and register fetch (ID):
During the Decode stage, we decode the instruction and fetch the source operands

(3) Execute (EX):
During the execute stage, the computer performs the operation specified by the instruction

(4) Memory access (MEM):
If there is any data that needs to be accessed, it is done in the memory stage

(5) Register write back (WB):
If we need to store the result in the destination location, it is done during the writeback stage

You can see that each stage is separated by pipeline registers ( the rectangle block with stage names on its each side IFID、IDEX、EX/MEM、MEM/WB ) in the block diagram.

Instruction in different type of format will go through 5 stages with different signal turned on. We can check the Pipeline diagram generated by Ripes below.

Let's discuss U-type and R-type format in detail with example.

U-type format

We can use the first instruction in this program auipc x8 0x10000 to demonstrate U-type instruction. According to RISC-V Manual (p.14):

AUIPC (add upper immediate to pc) is used to build pc-relative addresses and uses the U-type format. AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the pc, then places the result in register rd.

Let's see how it go through each stage.

1. Instruction fetch (IF)

(1) We start from instruction put at 0x00000000, so input 0x00000000intoaddr

(2) The machine code of first instruction is 0x10000417 (you can look it up in the code snippet above), so instr is equal to 0x10000417.

(3) PC will increment by 4 automatically using the above adder.

  • Because there is no branch occur, next instruction will be at PC + 4, so the multiplexer before PC choose input come from adder.
2. Instruction decode and register fetch (ID)

(1) Instruction 0x10000417 input to the decode and will be decoded

(2) Decode output opcode = auipc

(3) Decode output Wr idx = 0x08

(4) AUIPC and 0x10000417 input into imm., then imm. = 0x10000000 (0x10000 in upper 20 bits, filling in the lowest 12 bits with zeros)

  • U-type format doesn't read register, so R1 idx and R2 idx are both 0x00.
  • Reg 1 and Reg 2 read value from 0x00 register, so their value are both 0x00000000 too.
3. Execute (EX)

(1) First level multiplexers choose value come from Reg 1 and Reg 2, but this is an U-type format instruction, we don't use them. So they are filtered by second level multiplexer.

(2) Reg 1 and Reg 2 are also send to branch block, but no branch is taken.

(3) Second level multiplexer choose value come from current PC value 0x00000000 and immediate 0x10000000 as Op1 and Op2 of ALU.

(4) ALU will add two operand together, so the Res is equal to 0x10000000.

4. Memory access (MEM)

(1) Res from ALU is send to Addr in Data memory

(2) Use as data memory address. Memory read data at address 0x10000000, so Read out is equal to 0x00000000. We can check the table below to confirm the data section of memory.

(3) 0x10000000 Pass through this stage and go to WB stage

(4) 0x10000000 Send back to EXE stage for next instruction to use

  • Reg 2 is send to Data in, but memory doesn't enable writing.
5. Register write back (WB)

(1) The multiplexer choose Res from ALU as final output. So the output value is 0x10000000.

  • The output value and Wr idx 0x08 are send back to registers block. Finally, the value 0x10000000 will be write into x8 register, whose ABI name is s0.

After all these stage are done, the register is updated like this:

R-type format

We can use the instruction in 0x00000058 (add x8 x7 x8)to demonstrate R-type instruction. According to RISC-V Manual (p.15):

ADD and SUB perform addition and subtraction respectively. Overflows are ignored and the low XLEN bits of results are written to the destination.

Let's see how it go through each stage.

1. Instruction fetch (IF)

(1) PC output daaress at 0x00000058, so input 0x00000058intoaddr

(2) The instruction at address 0x00000058 is 0x00838433 (you can look it up in the code snippet above), so instr is equal to 0x00838433.

(3) PC will increment by 4 automatically using the above adder.

  • Because there is no branch occur, next instruction will be at PC + 4, so the multiplexer before PC choose input come from adder.
2. Instruction decode and register fetch (ID)

(1) Instruction 0x00838433 input to the decode and will be decoded

(2) Decode output opcode = ADD

(3) Decode output Wr idx = 0x08

(4) 0x00838433 and ADD input into imm., then imm. = 0xdeadbeef

  • R-type format has to read register, so R1 idx and R2 idx are 0x07 and 0x08.
  • Reg 1 and Reg 2 read value from 0x07 and 0x08 register, so their value are 0x00000000 and 0x00000007.

3. Execute (EX)

(1) First level multiplexers choose value 0x0000000f and Reg 2. Due to the x7 register has changed by previous instruction but hasn't update to the register. Thus, multiplexers choose 0x0000000f come from previous instrction as x7 value.

(2) Reg 1 and Reg 2 are also send to branch block, but no branch is taken.

(3) Second level multiplexer choose value come from first level multiplexers as Op1 and Op2 of ALU.

(4) ALU will add two operand together, so the Res is equal to 0x00000016.

4. Memory access (MEM)

(1) Res from ALU is send to Addr in Data memory

(2) Use as data memory address. Memory read data at address 0x00000016, so Read out is equal to 0x2023ff41. We can check the table below to confirm the data section of memory.

(3) 0x00000016 Pass through this stage and go to WB stage

(4) 0x00000016 Send back to EXE stage for next instruction to use

  • Reg 2 is send to Data in, but memory doesn't enable writing.
5. Register write back (WB)

(1) The multiplexer choose Res from ALU as final output. So the output value is 0x00000016.

  • The output value and Wr idx 0x08 are send back to registers block. Finally, the value 0x00000016 will be write into x8 register, whose ABI name is s0.

After all these stage are done, the register is updated like this:

Reference