# Implement unsigned int mul by count leading zero Contributed by [Yan-You Chen](https://github.com/y0y0alex/CA/tree/main/homework1) We can know that the registers are usually 32bits, so I implement a 32bits * 32bits multiplier. ## Usually multiplier in C code This is the easiest mul code. We check the rightest bit of multiplier, if the rightest bit is 1, we add the multiplicand to the result. And then, we shift left the multiplicand, do 32 times. ```c #include <stdint.h> #include <stdio.h> #include <inttypes.h> uint64_t int_mul(uint32_t A, uint32_t B) { uint64_t result=0; for (int i = 0; i < 32; i++) { if ((A >> i) & 1) { result += ((uint64_t)B << i); } } return result; } int main(){ uint32_t A = 0x11111111; uint32_t B = 0xffffdddd; uint64_t result=0; result = int_mul(A, B); printf("uint64: %"PRIX64"\n", result); return 0; } ``` ## Usually multiplier in assembly code When I implement this assembly code, the first problem is that register only have 32 bits for you to store data. But 32 bits * 32 bits is 64 bits, so I use 2 registers to storage the data. Use a register to store high 32 bits, another to store low 32 bits. ```c .data data_1: .word 0x12345678 data_2: .word 0xffffdddd .text main: lw s0, data_1 #s0 = A lw s1, data_2 #s1 = B li s2, 0 #s2: high 32 of number li s3, 0 #s3: low 32 of number li s4, 0 #used to check how many bit should shift int_mul: slti t1, s4, 32 beq t1, zero, exit srl t0, s1, s4 andi t0, t0, 0x00000001 #check B's rightest bit beq t0, zero, skip #if(rightest bit is zero) jump sll s5,s0,s4 #s0 is A,S5 the low bit i want li t2, 32 sub t2, t2, s4 srl s6, s0, t2 #s0 is A, S6 the high bit i want add s7, s3, s5 #s7 is 32_low + low bit i want jal overflow_detect_function no_overflow: add s2, s2, s6 jal skip skip: addi s4, s4 ,1 jal int_mul overflow_detect_function: sltu t3, s7, s3 mv s3, s7 beq t3, zero, no_overflow # if not jump --> overflow add s2, s2, s6 addi s2, s2, 1 addi s4, s4 ,1 jal int_mul exit: li a7,10 ecall ``` The important code is that the low 32 bits are the multiplicand sll X bit, and then the high 32 bits you want are the multiplicand srl (32-X) bits. ```c sll s5,s0,s4 #s0 is A,S5 the low bit i want li t2, 32 sub t2, t2, s4 srl s6, s0, t2 #s0 is A, S6 the high bit i want add s7, s3, s5 #s7 is 32_low + low bit i want ``` And we need to pay attendtion to adding number to the low 32 bits. When we add the low 32 bits to the register which storage the finally data, that may cause overflow. ```c # s7 is finally_32_low + low bit i want # s3 is the finally_32_low overflow_detect_function: sltu t3, s7, s3 mv s3, s7 beq t3, zero, no_overflow # if not jump --> overflow add s2, s2, s6 addi s2, s2, 1 addi s4, s4 ,1 jal int_mul ``` So I use the "sltu" to check whether the adding of two unsigned int cause overflow. If adding answer of two number is smaller then the origin low 32 bits, that means overflow happen. ## CLZ code in C This is base on the Quiz1-A problem, I make a little change for the code. I change the input x's bits. Make 64 bits to 32 bits, so the mask and operation has some different. ```c #include <stdint.h> #include <stdio.h> uint32_t CLZ_32(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } ``` ## CLZ code in assembly This used the most basic method to generate this assembly code. ```c .data mask_1: .word 0x55555555 mask_2: .word 0x33333333 mask_3: .word 0x0f0f0f0f .text CLZ: #a1: the num(x) you want to count CLZ #t0: shifted x srli t0, a1, 1 # t0 = x>>1 or a1, a1, t1 # x |= x>>1 srli t0, a1, 2 # t0 = x>>2 or a1, a1, t0 # x |= x>>2 srli t0, a1, 4 # t0 = x>>4 or a1, a1, t0 # x |= x>>4 srli t0, a1, 8 # t0 = x>>8 or a1, a1, t0 # x |= x>>8 srli t0, a1, 16 # t0 = x>>16 or a1, a1, t0 # x |= x>>16 #start lw t2, mask_1 srli t0, a1, 1 # t0 = x>>1 and t1, t0, t2 # t1 = (x>>1)&mask_1 sub a1, a1, t1 # x -= ((x>>1)&mask_1) lw t2, mask_2 # load mask_2 to t2 srli t0, a1, 2 # t0 = x>>2 and t1, t0, t2 # (x>>2)&mask_2 and a1, a1, t2 # x & mask_2 add a1, t1, a1 # ((x>>2)&mask_2) + (x&mask_2) srli t0, a1, 4 # t0 = x>>4 add a1, a1, t0 # x + (x>>4) lw t2, mask_3 # load mask_3 to t2 and a1, a1, t2 # ((x>>4)+x)&mask_4 srli t0, a1, 8 # t0 = x>>8 add a1, a1, t0 # x += (x>>8) srli t0, a1, 16 # t0 = x>>16 add a1, a1, t0 # x += (x>>16) andi t0, a1, 0x3f # t0 = x&0x3f li a1, 32 # a0 = 32 sub a1, a1, t0 # 32 - (x&0x3f) ``` # Use CLZ in multiplier We can see the following example: ```c A : 0xFFFFFFFF B : 0x00000001 A * B is the answer I want answer : 0x00000000FFFFFFFF ``` If you ues the usual bit multiplier, although left side bits of B have a lot of zero, it still need to do the all operation. If we skip those zero you don't need to do in the left. So this means we can save so many operation that multiplier do. There I discover the CLZ can reduce the useless operation. ## multiplier with CLZ in C code This is also a shift bit multiplier, but we check the CLZ, to reduce the useless operation. ```c #include <stdint.h> #include <stdio.h> uint16_t CLZ_32(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } uint64_t efficient_int_mul(uint32_t A, uint32_t B) { uint16_t n = CLZ_32(A); uint16_t m = CLZ_32(B); uint16_t result_bits; if(n>m) result_bits = n; else{ result_bits = m; uint32_t temp = A; A = B; B = temp; } uint64_t result = 0; for (int i = 0; i < 32-result_bits; i++) { if ((A >> i) & 1) { result += ((uint64_t)B << i); } } return result; } int main() { uint32_t A = 0x12345678; uint32_t B = 0xffffdddd; uint64_t result = efficient_int_mul(A, B); printf("uint64: %"PRIX64"\n", result); return 0; } ``` ### compare origin and new with CLZ in C code ``` test data: A: 0x 1234 5678 B: 0x ffff dddd correct answer: 0x 1234 540A 8F5C 3D98 ``` #### origin answer: <s> ![ans1](https://hackmd.io/_uploads/SyopCrheT.png) </s> :::warning :warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text. :notes: jserv ::: #### new with CLZ answer: ![ans2](https://hackmd.io/_uploads/BkXohrnxa.png) Both answer are correct. ## multiplier with CLZ in assembly code ```c .data data_1: .word 0x12345678 data_2: .word 0xffffdddd mask_1: .word 0x55555555 mask_2: .word 0x33333333 mask_3: .word 0x0f0f0f0f .text main: lw s0, data_1 #s0 = A lw s1, data_2 #s1 = B mv a0, s0 jal ra, CLZ mv t5, a0 # A's CLZ -> t5 mv a0, s1 jal ra, CLZ mv t6, a0 # B's CLZ -> t6 slt t0, t5, t6 # if A's zero less than B's, t0=1 li a0, 32 bne t0, zero, start_mul mv t0,s0 mv s0, s1 mv s1, t0 mv t6, t5 start_mul: #reset sub a0, a0, t6 li t0, 0 li t1, 0 li t2, 0 li s2, 0 #s2: high 32 of number li s3, 0 #s3: low 32 of number li s4, 0 #used to check how many bit should shift int_mul: slt t1, s4, a0 beq t1, zero, exit srl t0, s1, s4 andi t0, t0, 0x00000001 #check B's rightest bit beq t0, zero, skip #if(rightest bit is zero) jump sll s5,s0,s4 #s0 is A,S5 the low bit i want li t2, 32 sub t2, t2, s4 srl s6, s0, t2 #s0 is A, S6 the high bit i want add s7, s3, s5 #s7 is 32_low + low bit i want jal overflow_detect_function no_overflow: add s2, s2, s6 jal skip skip: addi s4, s4 ,1 jal int_mul overflow_detect_function: sltu t3, s7, s3 mv s3, s7 beq t3, zero, no_overflow # if not jump --> overflow add s2, s2, s6 addi s2, s2, 1 addi s4, s4 ,1 jal int_mul CLZ: #a0: the num(x) you want to count CLZ #t0: shifted x srli t0, a0, 1 # t0 = x>>1 or a0, a0, t0 # x |= x>>1 srli t0, a0, 2 # t0 = x>>2 or a0, a0, t0 # x |= x>>2 srli t0, a0, 4 # t0 = x>>4 or a0, a0, t0 # x |= x>>4 srli t0, a0, 8 # t0 = x>>8 or a0, a0, t0 # x |= x> 8 srli t0, a0, 16 # t0 = x>>16 or a0, a0, t0 # x |= x>>16 #start_mask lw t2, mask_1 srli t0, a0, 1 # t0 = x>>1 and t1, t0, t2 # t1 = (x>>1)&mask1 sub a0, a0, t1 # x -= (x>>1)&mask1 lw t2, mask_2 # mask_2 to t2 srli t0, a0, 2 # t0 = x>>2 and t1, t0, t2 # (x>>2)&mask_2 and a0, a0, t2 # x = x&mask_2 add a0, t1, a0 # x = ((x>>2)&mask_2) + (x&mask_2) srli t0, a0, 4 # t0 = x>>4 add a0, a0, t0 # x = x+(x>>4) lw t2, mask_3 # mask_3 to t2 and a0, a0, t2 # ((x>>4) + x) & mask_3 srli t0, a0, 8 # t0 = x>>8 add a0, a0, t0 # x += x>>8 srli t0, a0, 16 # t0 = x>>16 add a0, a0, t0 # x += x>>16 andi t0, a0, 0x3f # t0 = x&0x3f li a0, 32 # a0 = 32 sub a0, a0, t0 # 32 - (x & 0x3f) ret exit: li a7,10 ecall ``` The different is not only add CLZ, I also check which varible have more leading zero. Take the more number of leading zero, can reduce more operations. ``` example: (1) 0x 0000 0001 0x 1234 5678 ==> shift right this ------------ (2) 0x 1234 5678 0x 0000 0001 ==> shift right this ``` (2) is clearly have less shift right operation. SO, if the type is like (1), we change the multiplier and multiplicand position. ``` # A's CLZ -> t5 # B's CLZ -> t6 slt t0, t5, t6 # if A's zero less than B's, t0=1 li a0, 32 bne t0, zero, start_mul mv t0,s0 mv s0, s1 mv s1, t0 mv t6, t5 ``` ## Compare with old and new design in assembly code with Ripes s2: high 32 of number s3: low 32 of number ### **test data case1** ```c A : 0x12345678 B : 0xffffdddd answer : 0x1234540A8F5C3D98 ``` #### origin Ripes: ![](https://hackmd.io/_uploads/Syb3NLhl6.png) Final answer in register ![](https://hackmd.io/_uploads/B1C2HI2ga.png) ![](https://hackmd.io/_uploads/SyUgLU2xa.png) #### new with CLZ Ripes: ![](https://hackmd.io/_uploads/B1FpEL2ea.png) ![](https://hackmd.io/_uploads/B1C2HI2ga.png) ![](https://hackmd.io/_uploads/SyUgLU2xa.png) ### **test data case2** ```c A : 0x00000005 B : 0x00000007 answer : 0x0000000000000023 ``` #### origin Ripes: ![](https://hackmd.io/_uploads/r1ZGwU2xT.png) ![](https://hackmd.io/_uploads/HyGdPIhgT.png) ![](https://hackmd.io/_uploads/Syr5wLnxT.png) #### new with CLZ Ripes: ![](https://hackmd.io/_uploads/SyDgdU2x6.png) ![](https://hackmd.io/_uploads/r1SGu82x6.png) ### **test data case3** ```c A : 0xFFFF1111 B : 0x00000101 answer : 0x00000100FF102211 ``` #### origin Ripes: ![](https://hackmd.io/_uploads/S15n5U3e6.png) ![](https://hackmd.io/_uploads/ryL6c8nga.png) #### new with CLZ Ripes: ![](https://hackmd.io/_uploads/HJkzo82ga.png) ![](https://hackmd.io/_uploads/SyLMo82xT.png) ### SHOW? We can see the multiplier with CLZ has less cycles than normaly. :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv ::: ## **Analysis and Pipeline** ### The code is generated by Ripes ```c 00000000 <main>: 0: 10000417 auipc x8 0x10000 4: 00042403 lw x8 0 x8 8: 10000497 auipc x9 0x10000 c: ffc4a483 lw x9 -4 x9 10: 00040513 addi x10 x8 0 14: 0a4000ef jal x1 164 <CLZ> 18: 00050f13 addi x30 x10 0 1c: 00048513 addi x10 x9 0 20: 098000ef jal x1 152 <CLZ> 24: 00050f93 addi x31 x10 0 28: 01ff22b3 slt x5 x30 x31 2c: 02000513 addi x10 x0 32 30: 00029a63 bne x5 x0 20 <start_mul> 34: 00040293 addi x5 x8 0 38: 00048413 addi x8 x9 0 3c: 00028493 addi x9 x5 0 40: 000f0f93 addi x31 x30 0 00000044 <start_mul>: 44: 41f50533 sub x10 x10 x31 48: 00000293 addi x5 x0 0 4c: 00000313 addi x6 x0 0 50: 00000393 addi x7 x0 0 54: 00000913 addi x18 x0 0 58: 00000993 addi x19 x0 0 5c: 00000a13 addi x20 x0 0 00000060 <int_mul>: 60: 00aa2333 slt x6 x20 x10 64: 0c030e63 beq x6 x0 220 <exit> 68: 0144d2b3 srl x5 x9 x20 6c: 0012f293 andi x5 x5 1 70: 02028263 beq x5 x0 36 <skip> 74: 01441ab3 sll x21 x8 x20 78: 02000393 addi x7 x0 32 7c: 414383b3 sub x7 x7 x20 80: 00745b33 srl x22 x8 x7 84: 01598bb3 add x23 x19 x21 88: 014000ef jal x1 20 <overflow_detect_function> 0000008c <no_overflow>: 8c: 01690933 add x18 x18 x22 90: 004000ef jal x1 4 <skip> 00000094 <skip>: 94: 001a0a13 addi x20 x20 1 98: fc9ff0ef jal x1 -56 <int_mul> 0000009c <overflow_detect_function>: 9c: 013bbe33 sltu x28 x23 x19 a0: 000b8993 addi x19 x23 0 a4: fe0e04e3 beq x28 x0 -24 <no_overflow> a8: 01690933 add x18 x18 x22 ac: 00190913 addi x18 x18 1 b0: 001a0a13 addi x20 x20 1 b4: fadff0ef jal x1 -84 <int_mul> 000000b8 <CLZ>: b8: 00155293 srli x5 x10 1 bc: 00556533 or x10 x10 x5 c0: 00255293 srli x5 x10 2 c4: 00556533 or x10 x10 x5 c8: 00455293 srli x5 x10 4 cc: 00556533 or x10 x10 x5 d0: 00855293 srli x5 x10 8 d4: 00556533 or x10 x10 x5 d8: 01055293 srli x5 x10 16 dc: 00556533 or x10 x10 x5 e0: 10000397 auipc x7 0x10000 e4: f283a383 lw x7 -216 x7 e8: 00155293 srli x5 x10 1 ec: 0072f333 and x6 x5 x7 f0: 40650533 sub x10 x10 x6 f4: 10000397 auipc x7 0x10000 f8: f183a383 lw x7 -232 x7 fc: 00255293 srli x5 x10 2 100: 0072f333 and x6 x5 x7 104: 00757533 and x10 x10 x7 108: 00a30533 add x10 x6 x10 10c: 00455293 srli x5 x10 4 110: 00550533 add x10 x10 x5 114: 10000397 auipc x7 0x10000 118: efc3a383 lw x7 -260 x7 11c: 00757533 and x10 x10 x7 120: 00855293 srli x5 x10 8 124: 00550533 add x10 x10 x5 128: 01055293 srli x5 x10 16 12c: 00550533 add x10 x10 x5 130: 03f57293 andi x5 x10 63 134: 02000513 addi x10 x0 32 138: 40550533 sub x10 x10 x5 13c: 00008067 jalr x0 x1 0 00000140 <exit>: 140: 00a00893 addi x17 x0 10 144: 00000073 ecall ``` ### IF stage When ==PC== is ==0x00==, the instruction ==0: 10000417 auipc x8 0x10000== is in IF. This instruction is ==auipc a0 0x10000==. ![](https://hackmd.io/_uploads/BJZfNDnga.png) We can see that output of ==PC== is ==0x00000000==, which is the address of ==auipc== instruction. And hex instruction 0x10000417 is the output come from the compressed decoder. ### ID stage When in this stage, the decoder will decode the instruction, and imm will output the immediate number ==(0x10000000)==. We can clear check the decoder's output ==AUIPC==. ![](https://hackmd.io/_uploads/BJFfLD2la.png) ### EXE stage During EXE stage, ==auipc== instruction is need to add the address and imm, so we can see the ALU output is 0x10000000. The number is ==0x10000000 + 0 = 0x10000000== ![](https://hackmd.io/_uploads/B1t_aw2e6.png) ### MEM stage Because the ==auipc== instruction is not the type which need to store or load data from memory, so the data just ==pass through== this MEM stage. This instruction moment has some different, we can see ==the EXE stage instruction LW== need the ==x8== register, so we need to forwards the valuew of x8 to EXE stage. We can see the yellow line, the number is forwarded. Let the LW instruction to calculate ==0 + 0x10000000== ![](https://hackmd.io/_uploads/rke-qO2l6.png) ### WB stage This stage we can see that we need to update the ==x8== register. So we can see the yellow line data and write back to register. We can see the register signal is 0x1. ![](https://hackmd.io/_uploads/HJK0FOnl6.png)