# Implement unsigned int mul by count leading zero
Contributed by [Yan-You Chen](https://github.com/y0y0alex/CA/tree/main/homework1)
We can know that the registers are usually 32bits, so I implement a 32bits * 32bits multiplier.
## Usually multiplier in C code
This is the easiest mul code.
We check the rightest bit of multiplier, if the rightest bit is 1, we add the multiplicand to the result.
And then, we shift left the multiplicand, do 32 times.
```c
#include <stdint.h>
#include <stdio.h>
#include <inttypes.h>
uint64_t int_mul(uint32_t A, uint32_t B) {
uint64_t result=0;
for (int i = 0; i < 32; i++) {
if ((A >> i) & 1) {
result += ((uint64_t)B << i);
}
}
return result;
}
int main(){
uint32_t A = 0x11111111;
uint32_t B = 0xffffdddd;
uint64_t result=0;
result = int_mul(A, B);
printf("uint64: %"PRIX64"\n", result);
return 0;
}
```
## Usually multiplier in assembly code
When I implement this assembly code, the first problem is that register only have 32 bits for you to store data. But 32 bits * 32 bits is 64 bits, so I use 2 registers to storage the data. Use a register to store high 32 bits, another to store low 32 bits.
```c
.data
data_1: .word 0x12345678
data_2: .word 0xffffdddd
.text
main:
lw s0, data_1 #s0 = A
lw s1, data_2 #s1 = B
li s2, 0 #s2: high 32 of number
li s3, 0 #s3: low 32 of number
li s4, 0 #used to check how many bit should shift
int_mul:
slti t1, s4, 32
beq t1, zero, exit
srl t0, s1, s4
andi t0, t0, 0x00000001 #check B's rightest bit
beq t0, zero, skip #if(rightest bit is zero) jump
sll s5,s0,s4 #s0 is A,S5 the low bit i want
li t2, 32
sub t2, t2, s4
srl s6, s0, t2 #s0 is A, S6 the high bit i want
add s7, s3, s5 #s7 is 32_low + low bit i want
jal overflow_detect_function
no_overflow:
add s2, s2, s6
jal skip
skip:
addi s4, s4 ,1
jal int_mul
overflow_detect_function:
sltu t3, s7, s3
mv s3, s7
beq t3, zero, no_overflow
# if not jump --> overflow
add s2, s2, s6
addi s2, s2, 1
addi s4, s4 ,1
jal int_mul
exit:
li a7,10
ecall
```
The important code is that the low 32 bits are the multiplicand sll X bit, and then the high 32 bits you want are the multiplicand srl (32-X) bits.
```c
sll s5,s0,s4 #s0 is A,S5 the low bit i want
li t2, 32
sub t2, t2, s4
srl s6, s0, t2 #s0 is A, S6 the high bit i want
add s7, s3, s5 #s7 is 32_low + low bit i want
```
And we need to pay attendtion to adding number to the low 32 bits. When we add the low 32 bits to the register which storage the finally data, that may cause overflow.
```c
# s7 is finally_32_low + low bit i want
# s3 is the finally_32_low
overflow_detect_function:
sltu t3, s7, s3
mv s3, s7
beq t3, zero, no_overflow
# if not jump --> overflow
add s2, s2, s6
addi s2, s2, 1
addi s4, s4 ,1
jal int_mul
```
So I use the "sltu" to check whether the adding of two unsigned int cause overflow. If adding answer of two number is smaller then the origin low 32 bits, that means overflow happen.
## CLZ code in C
This is base on the Quiz1-A problem, I make a little change for the code. I change the input x's bits.
Make 64 bits to 32 bits, so the mask and operation has some different.
```c
#include <stdint.h>
#include <stdio.h>
uint32_t CLZ_32(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x3f));
}
```
## CLZ code in assembly
This used the most basic method to generate this assembly code.
```c
.data
mask_1: .word 0x55555555
mask_2: .word 0x33333333
mask_3: .word 0x0f0f0f0f
.text
CLZ:
#a1: the num(x) you want to count CLZ
#t0: shifted x
srli t0, a1, 1 # t0 = x>>1
or a1, a1, t1 # x |= x>>1
srli t0, a1, 2 # t0 = x>>2
or a1, a1, t0 # x |= x>>2
srli t0, a1, 4 # t0 = x>>4
or a1, a1, t0 # x |= x>>4
srli t0, a1, 8 # t0 = x>>8
or a1, a1, t0 # x |= x>>8
srli t0, a1, 16 # t0 = x>>16
or a1, a1, t0 # x |= x>>16
#start
lw t2, mask_1
srli t0, a1, 1 # t0 = x>>1
and t1, t0, t2 # t1 = (x>>1)&mask_1
sub a1, a1, t1 # x -= ((x>>1)&mask_1)
lw t2, mask_2 # load mask_2 to t2
srli t0, a1, 2 # t0 = x>>2
and t1, t0, t2 # (x>>2)&mask_2
and a1, a1, t2 # x & mask_2
add a1, t1, a1 # ((x>>2)&mask_2) + (x&mask_2)
srli t0, a1, 4 # t0 = x>>4
add a1, a1, t0 # x + (x>>4)
lw t2, mask_3 # load mask_3 to t2
and a1, a1, t2 # ((x>>4)+x)&mask_4
srli t0, a1, 8 # t0 = x>>8
add a1, a1, t0 # x += (x>>8)
srli t0, a1, 16 # t0 = x>>16
add a1, a1, t0 # x += (x>>16)
andi t0, a1, 0x3f # t0 = x&0x3f
li a1, 32 # a0 = 32
sub a1, a1, t0 # 32 - (x&0x3f)
```
# Use CLZ in multiplier
We can see the following example:
```c
A : 0xFFFFFFFF
B : 0x00000001
A * B is the answer I want
answer : 0x00000000FFFFFFFF
```
If you ues the usual bit multiplier, although left side bits of B have a lot of zero, it still need to do the all operation. If we skip those zero you don't need to do in the left. So this means we can save so many operation that multiplier do. There I discover the CLZ can reduce the useless operation.
## multiplier with CLZ in C code
This is also a shift bit multiplier, but we check the CLZ, to reduce the useless operation.
```c
#include <stdint.h>
#include <stdio.h>
uint16_t CLZ_32(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x3f));
}
uint64_t efficient_int_mul(uint32_t A, uint32_t B) {
uint16_t n = CLZ_32(A);
uint16_t m = CLZ_32(B);
uint16_t result_bits;
if(n>m)
result_bits = n;
else{
result_bits = m;
uint32_t temp = A;
A = B;
B = temp;
}
uint64_t result = 0;
for (int i = 0; i < 32-result_bits; i++) {
if ((A >> i) & 1) {
result += ((uint64_t)B << i);
}
}
return result;
}
int main() {
uint32_t A = 0x12345678;
uint32_t B = 0xffffdddd;
uint64_t result = efficient_int_mul(A, B);
printf("uint64: %"PRIX64"\n", result);
return 0;
}
```
### compare origin and new with CLZ in C code
```
test data:
A: 0x 1234 5678
B: 0x ffff dddd
correct answer:
0x 1234 540A 8F5C 3D98
```
#### origin
answer:
<s>

</s>
:::warning
:warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text.
:notes: jserv
:::
#### new with CLZ
answer:

Both answer are correct.
## multiplier with CLZ in assembly code
```c
.data
data_1: .word 0x12345678
data_2: .word 0xffffdddd
mask_1: .word 0x55555555
mask_2: .word 0x33333333
mask_3: .word 0x0f0f0f0f
.text
main:
lw s0, data_1 #s0 = A
lw s1, data_2 #s1 = B
mv a0, s0
jal ra, CLZ
mv t5, a0 # A's CLZ -> t5
mv a0, s1
jal ra, CLZ
mv t6, a0 # B's CLZ -> t6
slt t0, t5, t6 # if A's zero less than B's, t0=1
li a0, 32
bne t0, zero, start_mul
mv t0,s0
mv s0, s1
mv s1, t0
mv t6, t5
start_mul:
#reset
sub a0, a0, t6
li t0, 0
li t1, 0
li t2, 0
li s2, 0 #s2: high 32 of number
li s3, 0 #s3: low 32 of number
li s4, 0 #used to check how many bit should shift
int_mul:
slt t1, s4, a0
beq t1, zero, exit
srl t0, s1, s4
andi t0, t0, 0x00000001 #check B's rightest bit
beq t0, zero, skip #if(rightest bit is zero) jump
sll s5,s0,s4 #s0 is A,S5 the low bit i want
li t2, 32
sub t2, t2, s4
srl s6, s0, t2 #s0 is A, S6 the high bit i want
add s7, s3, s5 #s7 is 32_low + low bit i want
jal overflow_detect_function
no_overflow:
add s2, s2, s6
jal skip
skip:
addi s4, s4 ,1
jal int_mul
overflow_detect_function:
sltu t3, s7, s3
mv s3, s7
beq t3, zero, no_overflow
# if not jump --> overflow
add s2, s2, s6
addi s2, s2, 1
addi s4, s4 ,1
jal int_mul
CLZ:
#a0: the num(x) you want to count CLZ
#t0: shifted x
srli t0, a0, 1 # t0 = x>>1
or a0, a0, t0 # x |= x>>1
srli t0, a0, 2 # t0 = x>>2
or a0, a0, t0 # x |= x>>2
srli t0, a0, 4 # t0 = x>>4
or a0, a0, t0 # x |= x>>4
srli t0, a0, 8 # t0 = x>>8
or a0, a0, t0 # x |= x> 8
srli t0, a0, 16 # t0 = x>>16
or a0, a0, t0 # x |= x>>16
#start_mask
lw t2, mask_1
srli t0, a0, 1 # t0 = x>>1
and t1, t0, t2 # t1 = (x>>1)&mask1
sub a0, a0, t1 # x -= (x>>1)&mask1
lw t2, mask_2 # mask_2 to t2
srli t0, a0, 2 # t0 = x>>2
and t1, t0, t2 # (x>>2)&mask_2
and a0, a0, t2 # x = x&mask_2
add a0, t1, a0 # x = ((x>>2)&mask_2) + (x&mask_2)
srli t0, a0, 4 # t0 = x>>4
add a0, a0, t0 # x = x+(x>>4)
lw t2, mask_3 # mask_3 to t2
and a0, a0, t2 # ((x>>4) + x) & mask_3
srli t0, a0, 8 # t0 = x>>8
add a0, a0, t0 # x += x>>8
srli t0, a0, 16 # t0 = x>>16
add a0, a0, t0 # x += x>>16
andi t0, a0, 0x3f # t0 = x&0x3f
li a0, 32 # a0 = 32
sub a0, a0, t0 # 32 - (x & 0x3f)
ret
exit:
li a7,10
ecall
```
The different is not only add CLZ, I also check which varible have more leading zero. Take the more number of leading zero, can reduce more operations.
```
example:
(1) 0x 0000 0001
0x 1234 5678 ==> shift right this
------------
(2) 0x 1234 5678
0x 0000 0001 ==> shift right this
```
(2) is clearly have less shift right operation.
SO, if the type is like (1), we change the multiplier and multiplicand position.
```
# A's CLZ -> t5
# B's CLZ -> t6
slt t0, t5, t6 # if A's zero less than B's, t0=1
li a0, 32
bne t0, zero, start_mul
mv t0,s0
mv s0, s1
mv s1, t0
mv t6, t5
```
## Compare with old and new design in assembly code with Ripes
s2: high 32 of number
s3: low 32 of number
### **test data case1**
```c
A : 0x12345678
B : 0xffffdddd
answer : 0x1234540A8F5C3D98
```
#### origin
Ripes:

Final answer in register


#### new with CLZ
Ripes:



### **test data case2**
```c
A : 0x00000005
B : 0x00000007
answer : 0x0000000000000023
```
#### origin
Ripes:



#### new with CLZ
Ripes:


### **test data case3**
```c
A : 0xFFFF1111
B : 0x00000101
answer : 0x00000100FF102211
```
#### origin
Ripes:


#### new with CLZ
Ripes:


### SHOW?
We can see the multiplier with CLZ has less cycles than normaly.
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::
## **Analysis and Pipeline**
### The code is generated by Ripes
```c
00000000 <main>:
0: 10000417 auipc x8 0x10000
4: 00042403 lw x8 0 x8
8: 10000497 auipc x9 0x10000
c: ffc4a483 lw x9 -4 x9
10: 00040513 addi x10 x8 0
14: 0a4000ef jal x1 164 <CLZ>
18: 00050f13 addi x30 x10 0
1c: 00048513 addi x10 x9 0
20: 098000ef jal x1 152 <CLZ>
24: 00050f93 addi x31 x10 0
28: 01ff22b3 slt x5 x30 x31
2c: 02000513 addi x10 x0 32
30: 00029a63 bne x5 x0 20 <start_mul>
34: 00040293 addi x5 x8 0
38: 00048413 addi x8 x9 0
3c: 00028493 addi x9 x5 0
40: 000f0f93 addi x31 x30 0
00000044 <start_mul>:
44: 41f50533 sub x10 x10 x31
48: 00000293 addi x5 x0 0
4c: 00000313 addi x6 x0 0
50: 00000393 addi x7 x0 0
54: 00000913 addi x18 x0 0
58: 00000993 addi x19 x0 0
5c: 00000a13 addi x20 x0 0
00000060 <int_mul>:
60: 00aa2333 slt x6 x20 x10
64: 0c030e63 beq x6 x0 220 <exit>
68: 0144d2b3 srl x5 x9 x20
6c: 0012f293 andi x5 x5 1
70: 02028263 beq x5 x0 36 <skip>
74: 01441ab3 sll x21 x8 x20
78: 02000393 addi x7 x0 32
7c: 414383b3 sub x7 x7 x20
80: 00745b33 srl x22 x8 x7
84: 01598bb3 add x23 x19 x21
88: 014000ef jal x1 20 <overflow_detect_function>
0000008c <no_overflow>:
8c: 01690933 add x18 x18 x22
90: 004000ef jal x1 4 <skip>
00000094 <skip>:
94: 001a0a13 addi x20 x20 1
98: fc9ff0ef jal x1 -56 <int_mul>
0000009c <overflow_detect_function>:
9c: 013bbe33 sltu x28 x23 x19
a0: 000b8993 addi x19 x23 0
a4: fe0e04e3 beq x28 x0 -24 <no_overflow>
a8: 01690933 add x18 x18 x22
ac: 00190913 addi x18 x18 1
b0: 001a0a13 addi x20 x20 1
b4: fadff0ef jal x1 -84 <int_mul>
000000b8 <CLZ>:
b8: 00155293 srli x5 x10 1
bc: 00556533 or x10 x10 x5
c0: 00255293 srli x5 x10 2
c4: 00556533 or x10 x10 x5
c8: 00455293 srli x5 x10 4
cc: 00556533 or x10 x10 x5
d0: 00855293 srli x5 x10 8
d4: 00556533 or x10 x10 x5
d8: 01055293 srli x5 x10 16
dc: 00556533 or x10 x10 x5
e0: 10000397 auipc x7 0x10000
e4: f283a383 lw x7 -216 x7
e8: 00155293 srli x5 x10 1
ec: 0072f333 and x6 x5 x7
f0: 40650533 sub x10 x10 x6
f4: 10000397 auipc x7 0x10000
f8: f183a383 lw x7 -232 x7
fc: 00255293 srli x5 x10 2
100: 0072f333 and x6 x5 x7
104: 00757533 and x10 x10 x7
108: 00a30533 add x10 x6 x10
10c: 00455293 srli x5 x10 4
110: 00550533 add x10 x10 x5
114: 10000397 auipc x7 0x10000
118: efc3a383 lw x7 -260 x7
11c: 00757533 and x10 x10 x7
120: 00855293 srli x5 x10 8
124: 00550533 add x10 x10 x5
128: 01055293 srli x5 x10 16
12c: 00550533 add x10 x10 x5
130: 03f57293 andi x5 x10 63
134: 02000513 addi x10 x0 32
138: 40550533 sub x10 x10 x5
13c: 00008067 jalr x0 x1 0
00000140 <exit>:
140: 00a00893 addi x17 x0 10
144: 00000073 ecall
```
### IF stage
When ==PC== is ==0x00==, the instruction ==0: 10000417 auipc x8 0x10000== is in IF.
This instruction is ==auipc a0 0x10000==.

We can see that output of ==PC== is ==0x00000000==, which is the address of ==auipc== instruction. And hex instruction 0x10000417 is the output come from the compressed decoder.
### ID stage
When in this stage, the decoder will decode the instruction, and imm will output the immediate number ==(0x10000000)==. We can clear check the decoder's output ==AUIPC==.

### EXE stage
During EXE stage, ==auipc== instruction is need to add the address and imm, so we can see the ALU output is 0x10000000.
The number is ==0x10000000 + 0 = 0x10000000==

### MEM stage
Because the ==auipc== instruction is not the type which need to store or load data from memory, so the data just ==pass through== this MEM stage.
This instruction moment has some different, we can see ==the EXE stage instruction LW== need the ==x8== register, so we need to forwards the valuew of x8 to EXE stage. We can see the yellow line, the number is forwarded. Let the LW instruction to calculate ==0 + 0x10000000==

### WB stage
This stage we can see that we need to update the ==x8== register. So we can see the yellow line data and write back to register. We can see the register signal is 0x1.
