contributed by < ChengChiTing
>
RISC-V
, jserv
The Collatz conjecture is one of the most famous unsolved problems in mathematics. It was posed by L. Collatz in 1937, also called the "3x+1 mapping", "3n+1 problem", "Hasse's algorithm", "Kakutani's problem", "Syracuse algorithm", "Syracuse problem", "Thwaites conjecture", and "Ulam's problem".
The conjecture asks whether repeating two simple arithmetic operations will eventually transform every "positive integer" into "1". It concerns sequences of integers in which each term is obtained from the previous term as follows:
(1) if the previous term is even, the next term is one half of the previous term (N = N/2).
(2) If the previous term is odd, the next term is 3 times the previous term plus 1 (N = 3N +1).
Example: if we choose number "7".
7 -> 22 -> 11 -> 34 -> 17 -> 52 -> 26 -> 13 -> 40 -> 20 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1
It needs 16 steps to reach "1".
The conjecture is that these sequences always reach 1, no matter which positive integer is chosen to start the sequence.
You can find the source code here. Feel free to fork and modify it.
When we input a test number"N", the function can calcute how many steps need for N to reach "1". There are two situation have to deal with, N = 0 and N = 1.
If N = 0, the function will exit and print "The number is zero".
If N = 1, it is defined as needing zero step to reach 1.
The loop has to confirm N is even or odd, first. After that, we can use right shift to make N/2 when N is even. If N is odd, we can use left shift to make 2N. Then plus previous N and 1 to get 3N + 1. The code write in C and RISC-V Assembly Language are shown below.
The code below is an implementation of Collatz conjecture.
#include <stdio.h>
#include <stdint.h>
#define data 7 // number N
// count_trailing_zeros function
uint16_t ctz(uint32_t x)
{
x |= (x << 1);
x |= (x << 2);
x |= (x << 4);
x |= (x << 8);
x |= (x << 16);
// count ones (population count)
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x7f));
}
// count how many steps for finishing Collatz conjecture
uint16_t collatz(uint32_t x){
if(x == 0){
printf("The number is zero \n");
return 0;
}
uint16_t result = 0;
uint16_t y = 0;
while(x != 1){
if((x & 0x01) == 1){
x += ((x << 1) + 1);
result ++;
}
else{
y = ctz(x);
x = x >> y;
result += y;
}
}
printf("The number of steps for Collatz conjecture is %u", result);
return result;
}
int main (){
uint32_t x = data;
x = collatz(x);
return 0;
}
The above C code can be separated into two parts, count_trailing_zeros and Collatz conjecture.
The count_trailing_zeros function(ctz) is derived from Quiz 1 count_leading_zeros function(clz).It can calcute the trailing zero of a 32-bit integers. A trailing zero is defined as any ‘0’ digit that appears after the last non-zero digit in the binary representation of a number.
Example: Input : 6720 -> Output : 6.
Explanation: As Binary of 6720 = (00000000000000000001101001000000)
There are 6 zeros appear after the last non-zero digit
The original C code in Quiz 1 Problem A is below.
#include <stdint.h>
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
// count ones (population count)
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 16);
return (64 - (x & 0x7f));
}
The above code complete two tasks. First, change all the bit on the right side of the first non-zero bit into "1", and left side of the first non-zero bit into "0". Second, implement population count to get the number of "1" in the 64-bit binary. Thus, if we subtract the number of "1" from 64, we can get the number of leading zero.
Now, we rewrite the above code and change it into 32-bit version.
x |= (x << 1);
x |= (x << 2);
x |= (x << 4);
x |= (x << 8);
x |= (x << 16);
After rewrite the code, we can change all the bit on the right side of the last non-zero bit into "1", and left side of the last non-zero bit into "0" and then we also implement population to get number of "1".Thus, we can get the number of trailing zero
The first thing we have to do is check input number N is zero or not. If N is not zero we can enter the loop and then we need to confirm N is even or odd.
If N is odd we left shift one bit of N and plus 1. Add result to the original N then we can get "3N + 1".
If N is even we right shift one bit of N to get "N/2".In the code above, if the factor of N is power of 2 we can use ctz to count the trailing zero and and make right shift once time.
Example: Input : 560 -> Output : 35.
Explanation: As Binary of 560 = (00000000000000000000001000110000)
After right shift = (00000000000000000000000000100011)
The counter will plus 1 if N is odd, and plus the number of trailing zero when N is even.
Finally, when N is 1, the program will exit and print how many steps need for N converting into 1.
.data
data0: .word 0 # N = 0
data1: .word 1 # 0 steps
data2: .word 7 # 16 steps
data3: .word 1024 # 10 steps
data4: .word 27 # 111 steps
str1: .string "The number is zero"
str2: .string "The number of steps for Collatz conjecture is "
.text
# s1 : test data
# a0 : the input number of collatz function
# a1 : the steps need for finishing collatz conjecture
# t0 : the input of ctz function
# a5 : the return value of ctzfunction
# ra : return address
main:
lw s1, data2
mv a0, s1
jal ra, collatz # call collatz function
li a7, 10 # exit program
ecall
collatz:
addi sp, sp, -4 # push s0
sw ra, 0(sp) # store ra into s0
add a1, zero, zero # result(a1) = 0
addi a2, zero, 1 # a2 = 0x01
bne zero, a0, loop # confirm a0 is 0 or not
mv t3, a0 # printf("The number is zero \n")
la a0, str1
li a7, 4
ecall
mv a0, t3
lw ra, 0(sp)
addi sp, sp, 4 # return s0
jr ra # end function
loop:
beq a0, a2, end # if number = 1, end the function
and a3, a0, a2 # confirm the number is even or odd
bne a3, a2, even
slli a3, a0, 1 # ood execute 3n + 1
addi a3, a3, 1
add a0, a3, a0
addi a1, a1, 1
j loop
even:
mv t0, a0 # even number execute n / 2
jal ra, ctz # call ctz function
srl a0, a0, a5 # remove trailing zero
add a1, a1, a5
j loop
end:
mv t3, a0 # printf("The number of steps for Collatz conjecture is ");
la a0, str2
li a7, 4
ecall
mv a0, a1
li a7, 1
ecall
mv a0, t3
lw ra, 0(sp)
addi sp, sp, 4 # return s0
jr ra # end function
ctz: # counting trailing zero
slli t1, t0, 1 # x |= (x << 1);
or t0, t0, t1
slli t1, t0, 2 # x |= (x << 2);
or t0, t0, t1
slli t1, t0, 4 # x |= (x << 4);
or t0, t0, t1
slli t1, t0, 8 # x |= (x << 8);
or t0, t0, t1
slli t1, t0, 16 # x |= (x << 16);
or t0, t0, t1
# count ones (population count)
li t1, 0x55555555 # 0x55555555
srli t2, t0, 1 # x >> 1
and t2, t1, t2 # (x >> 1) & 0x55555555
sub t0, t0, t2 # x -= ((x >> 1) & 0x55555555)
li t1, 0x33333333 # 0x33333333
srli t2, t0, 2 # x >> 2
and t2, t1, t2 # (x >> 2) & 0x33333333
and t0, t0, t1 # x & 0x33333333
add t0, t0, t2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
li t1, 0x0f0f0f0f # 0x0f0f0f0f
srli t2, t0, 4 # x >> 4
add t0, t2, t0 # (x >> 4) + x
and t0, t0, t1 # x = ((x >> 4) + x) & 0x0f0f0f0f
srli t2, t0, 8 # x >> 8
add t0, t0, t2 # x += (x >> 8)
srli t2, t0, 16 # x >> 16
add t0, t0, t2 # x += (x >> 16)
addi t1, zero,32
andi t0, t0, 0x7f # x & 0x7f
sub a5, t1, t0
jr ra
Rewrite the count trailing zero function in byte shift. This method is similar to binary search. It works by repeatedly dividing in half the portion of the list that could contain the item, until We find the item we want .
We make left shift first to check the last non-zero is on the left side or reght side. Then we make shift again and check the position of the last non-zero repeatedly until we find it. The counter will counter the trailing zeros during shifting.
The program costs log2(32) = 5 times for counting trailing zero.
The C code below is ctz in byte shift version.
uint16_t ctz(uint32_t x) // count_trailing_zeros function
{
if (x == 0) return 32; // if x = 0, return 32 immediately
int n = 1; // n is the number of trailing zeros
if ((x << 16) == 0) {
n += 16;
x >>= 16;
}
if ((x << 24) == 0) {
n += 8;
x >>= 8;
}
if ((x << 28) == 0) {
n += 4;
x >>= 4;
}
if ((x << 30) == 0) {
n += 2;
x >>= 2;
}
n = n - (x & 1);
return n;
}
The Assembly code below is an implementation of Collatz conjecture in byte shift version.
I rewrite some part of code, and use stack to reduce the using of registers.
.data
data0: .word 0 # N = 0
data1: .word 1 # 0 steps
data2: .word 7 # 16 steps
data3: .word 1024 # 10 steps
data4: .word 27 # 111 steps
str1: .string "The number is zero"
str2: .string "The number of steps for Collatz conjecture is "
.text
# s0 : test data
# s1 : the steps need for finishing collatz conjecture
# ra : return address
main:
lw s0, data2
jal ra, collatz # call collatz function
li a7, 10 # exit program
ecall
collatz:
addi sp, sp, -12
sw ra, 0(sp) # store ra into 0(sp)
sw s0, 4(sp) # store s0 into 4(sp)
add s1, zero, zero # result(s1) = 0
addi t0, zero, 1 # t0 = 1
bne zero, s0, loop # confirm a0 is 0 or not
la a0, str1
li a7, 4
ecall
addi sp, sp, 12
jr ra # end function
loop:
beq s0, t0, end # if number = 1, end the function
andi t1, s0, 1 # confirm the number is even or odd
bne t1, t0, even
slli t2, s0, 1 # ood execute 3n + 1
addi t2, t2, 1
add s0, t2, s0
addi s1, s1, 1
j loop
even:
jal ra, ctz # call ctz function
srl s0, s0, t1 # remove trailing zero
add s1, s1, t1
j loop
end:
la a0, str2 # printf("The number of steps for Collatz conjecture is ")
li a7, 4
ecall
mv a0, s1
li a7, 1
ecall
lw ra, 0(sp) # return ra
lw s0, 4(sp) # return s0
addi sp, sp, 12 # return stack
jr ra # end function
ctz: # counting trailing zero
sw s0, 8(sp) # store s0 into 8(sp)
addi t1, zero, 1
slli t2, s0, 16
bne t2, zero, bs24
addi t1, t1, 16
srli s0, s0, 16
bs24:
slli t2, s0, 24
bne t2, zero, bs28
addi t1, t1, 8
srli s0, s0, 8
bs28:
slli t2, s0, 28
bne t2, zero, bs30
addi t1, t1, 4
srli s0, s0, 4
bs30:
slli t2, s0, 30
bne t2, zero, bs31
addi t1, t1, 2
srli s0, s0, 2
bs31:
andi s0, s0, 1
sub t1, t1, s0
lw s0, 8(sp) # return s0 into 8(sp)
jr ra
We test our code using Ripes simulator.
Put assembly code above into the editor of Ripes and we will see Ripe doesn't execute it literally. Instead, it replace pseudo instruction (p.110) into equivalent one, and change register name from ABI name to Register name.
The translated code of Version 2 by Ripes looks like below :
00000000 <main>:
0: 10000417 auipc x8 0x10000
4: 00842403 lw x8 8 x8
8: 00c000ef jal x1 12 <collatz>
c: 00a00893 addi x17 x0 10
10: 00000073 ecall
00000014 <collatz>:
14: ff410113 addi x2 x2 -12
18: 00112023 sw x1 0 x2
1c: 00812223 sw x8 4 x2
20: 000004b3 add x9 x0 x0
24: 00100293 addi x5 x0 1
28: 00801e63 bne x0 x8 28 <loop>
2c: 10000517 auipc x10 0x10000
30: fe850513 addi x10 x10 -24
34: 00400893 addi x17 x0 4
38: 00000073 ecall
3c: 00c10113 addi x2 x2 12
40: 00008067 jalr x0 x1 0
00000044 <loop>:
44: 02540863 beq x8 x5 48 <end>
48: 00147313 andi x6 x8 1
4c: 00531c63 bne x6 x5 24 <even>
50: 00141393 slli x7 x8 1
54: 00138393 addi x7 x7 1
58: 00838433 add x8 x7 x8
5c: 00148493 addi x9 x9 1
60: fe5ff06f jal x0 -28 <loop>
00000064 <even>:
64: 03c000ef jal x1 60 <ctz>
68: 00645433 srl x8 x8 x6
6c: 006484b3 add x9 x9 x6
70: fd5ff06f jal x0 -44 <loop>
00000074 <end>:
74: 10000517 auipc x10 0x10000
78: fb350513 addi x10 x10 -77
7c: 00400893 addi x17 x0 4
80: 00000073 ecall
84: 00048513 addi x10 x9 0
88: 00100893 addi x17 x0 1
8c: 00000073 ecall
90: 00012083 lw x1 0 x2
94: 00412403 lw x8 4 x2
98: 00c10113 addi x2 x2 12
9c: 00008067 jalr x0 x1 0
000000a0 <ctz>:
a0: 00812423 sw x8 8 x2
a4: 00100313 addi x6 x0 1
a8: 01041393 slli x7 x8 16
ac: 00039663 bne x7 x0 12 <bs24>
b0: 01030313 addi x6 x6 16
b4: 01045413 srli x8 x8 16
000000b8 <bs24>:
b8: 01841393 slli x7 x8 24
bc: 00039663 bne x7 x0 12 <bs28>
c0: 00830313 addi x6 x6 8
c4: 00845413 srli x8 x8 8
000000c8 <bs28>:
c8: 01c41393 slli x7 x8 28
cc: 00039663 bne x7 x0 12 <bs30>
d0: 00430313 addi x6 x6 4
d4: 00445413 srli x8 x8 4
000000d8 <bs30>:
d8: 01e41393 slli x7 x8 30
dc: 00039663 bne x7 x0 12 <bs31>
e0: 00230313 addi x6 x6 2
e4: 00245413 srli x8 x8 2
000000e8 <bs31>:
e8: 00147413 andi x8 x8 1
ec: 40830333 sub x6 x6 x8
f0: 00812403 lw x8 8 x2
f4: 00008067 jalr x0 x1 0
In each row it denotes address in instruction memory, instruction's machine code (in hex) and instruction itself respectively.
We test five different test data (0, 1, 7, 27, 1024) respectively, and the results are shown below:
Test Data | Output Result |
---|---|
0 | ![]() |
1 | ![]() |
7 | ![]() |
27 | ![]() |
1024 | ![]() |
7 = {7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1}
27 = {27, 82, 41, 124, 62, 31, 94, 47, 142, 71, 214, 107, 322, 161, 484, 242, 121, 364, 182, 91, 274, 137, 412, 206, 103, 310, 155, 466, 233, 700, 350, 175, 526, 263, 790, 395, 1186, 593, 1780, 890, 445, 1336, 668, 334, 167, 502, 251, 754, 377, 1132, 566, 283, 850, 425, 1276, 638, 319, 958, 479, 1438, 719, 2158, 1079, 3238, 1619, 4858, 2429, 7288, 3644, 1822, 911, 2734, 1367, 4102, 2051, 6154, 3077, 9232, 4616, 2308, 1154, 577, 1732, 866, 433, 1300, 650, 325, 976, 488, 244, 122, 61, 184, 92, 46, 23, 70, 35, 106, 53, 160, 80, 40, 20, 10, 5, 16, 8, 4, 2, 1}
1024 = {1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1}
Accoding to the table above, the code can execute in Ripes and print the answer correctly. Furthermore, if we execute C code above in C compiler we can also get same output.
Comparing to Assembly code verision 1, version 2 use byte shift to accelerate trailing zero counting. After optimization, version 2 use less cycles to finish the Collatz conjecture program.
The cycles used by version 1 and version 2 are shown below:
Test Data | Version 1 | version 2 |
---|---|---|
0 | ![]() |
![]() |
1 | ![]() |
![]() |
7 | ![]() |
![]() |
27 | ![]() |
![]() |
1024 | ![]() |
![]() |
Accoding to the table above, we find version 2 cost less clcyes in every data, but this phenomenon isn't obvious in "0" and "1". The possible reasons may be they didn't enter the loop. Furthermore, we can notice when the number of steps increase, the difference between the two version is greater(e.g. N = 7, N = 27). If we use the power of 2 , like 1024 to run program, we can get the same result.
It can be seen that byte shift ctz can optimize Collatz conjecture function and make program be more efficient.
Ripes provide four kinds of processor for us to choose, including single cycle processor, 5-stage processor, 5-stage with hazard detection and 5-stage with forward and hazard detection. Here we choose the 5 stage processor. Its block diagram look like this:
The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The simplicity of operations performed allows every instruction to be completed in one processor cycle.The 5-stages are:
(1) Instruction fetch (IF):
In the Fetch stage, instruction is being fetched from the memory.
(2) Instruction decode and register fetch (ID):
During the Decode stage, we decode the instruction and fetch the source operands
(3) Execute (EX):
During the execute stage, the computer performs the operation specified by the instruction
(4) Memory access (MEM):
If there is any data that needs to be accessed, it is done in the memory stage
(5) Register write back (WB):
If we need to store the result in the destination location, it is done during the writeback stage
You can see that each stage is separated by pipeline registers ( the rectangle block with stage names on its each side IFID、IDEX、EX/MEM、MEM/WB ) in the block diagram.
Instruction in different type of format will go through 5 stages with different signal turned on. We can check the Pipeline diagram generated by Ripes below.
Let's discuss U-type and R-type format in detail with example.
We can use the first instruction in this program auipc x8 0x10000
to demonstrate U-type instruction. According to RISC-V Manual (p.14):
AUIPC (add upper immediate to pc) is used to build pc-relative addresses and uses the U-type format. AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the pc, then places the result in register rd.
Let's see how it go through each stage.
(1) We start from instruction put at 0x00000000
, so input 0x00000000
intoaddr
(2) The machine code of first instruction is 0x10000417
(you can look it up in the code snippet above), so instr
is equal to 0x10000417
.
(3) PC will increment by 4 automatically using the above adder.
(1) Instruction 0x10000417
input to the decode and will be decoded
(2) Decode output opcode
= auipc
(3) Decode output Wr idx
= 0x08
(4) AUIPC
and 0x10000417
input into imm.
, then imm.
= 0x10000000
(0x10000
in upper 20 bits, filling in the lowest 12 bits with zeros)
R1 idx
and R2 idx
are both 0x00
.Reg 1
and Reg 2
read value from 0x00
register, so their value are both 0x00000000
too.(1) First level multiplexers choose value come from Reg 1
and Reg 2
, but this is an U-type format instruction, we don't use them. So they are filtered by second level multiplexer.
(2) Reg 1
and Reg 2
are also send to branch block, but no branch is taken.
(3) Second level multiplexer choose value come from current PC value 0x00000000
and immediate 0x10000000
as Op1
and Op2
of ALU.
(4) ALU will add two operand together, so the Res
is equal to 0x10000000
.
(1) Res
from ALU is send to Addr
in Data memory
(2) Use as data memory address. Memory read data at address 0x10000000
, so Read out
is equal to 0x00000000
. We can check the table below to confirm the data section of memory.
(3) 0x10000000
Pass through this stage and go to WB stage
(4) 0x10000000
Send back to EXE stage for next instruction to use
Reg 2
is send to Data in
, but memory doesn't enable writing.(1) The multiplexer choose Res
from ALU as final output. So the output value is 0x10000000
.
Wr idx 0x08
are send back to registers block. Finally, the value 0x10000000
will be write into x8
register, whose ABI name is s0
.After all these stage are done, the register is updated like this:
We can use the instruction in 0x00000058
(add x8 x7 x8
)to demonstrate R-type instruction. According to RISC-V Manual (p.15):
ADD and SUB perform addition and subtraction respectively. Overflows are ignored and the low XLEN bits of results are written to the destination.
Let's see how it go through each stage.
(1) PC output daaress at 0x00000058
, so input 0x00000058
intoaddr
(2) The instruction at address 0x00000058
is 0x00838433
(you can look it up in the code snippet above), so instr
is equal to 0x00838433
.
(3) PC will increment by 4 automatically using the above adder.
(1) Instruction 0x00838433
input to the decode and will be decoded
(2) Decode output opcode
= ADD
(3) Decode output Wr idx
= 0x08
(4) 0x00838433
and ADD
input into imm.
, then imm.
= 0xdeadbeef
R1 idx
and R2 idx
are 0x07
and 0x08
.Reg 1
and Reg 2
read value from 0x07
and 0x08
register, so their value are 0x00000000
and 0x00000007
.(1) First level multiplexers choose value 0x0000000f
and Reg 2
. Due to the x7 register has changed by previous instruction but hasn't update to the register. Thus, multiplexers choose 0x0000000f
come from previous instrction as x7 value.
(2) Reg 1
and Reg 2
are also send to branch block, but no branch is taken.
(3) Second level multiplexer choose value come from first level multiplexers as Op1
and Op2
of ALU.
(4) ALU will add two operand together, so the Res
is equal to 0x00000016
.
(1) Res
from ALU is send to Addr
in Data memory
(2) Use as data memory address. Memory read data at address 0x00000016
, so Read out
is equal to 0x2023ff41
. We can check the table below to confirm the data section of memory.
(3) 0x00000016
Pass through this stage and go to WB stage
(4) 0x00000016
Send back to EXE stage for next instruction to use
Reg 2
is send to Data in
, but memory doesn't enable writing.(1) The multiplexer choose Res
from ALU as final output. So the output value is 0x00000016
.
Wr idx 0x08
are send back to registers block. Finally, the value 0x00000016
will be write into x8
register, whose ABI name is s0
.After all these stage are done, the register is updated like this: