Generate bitmask with CLZ

# Generate bitmask by CLZ contributed by <[yuchen0620](https://github.com/yuchen0620)> [Bitmask](https://en.wikipedia.org/wiki/Mask_(computing)) is common used in computer science field. Using a mask, multiple bits in a byte, [nibble](https://en.wikipedia.org/wiki/Nibble), word, etc. can be set either on or off, or inverted from on to off (or vice versa) in a single bitwise operation. Hence, I use Count Leading Zeros(CLZ) to generate bitmask!With CLZ, we can generate bitmask which has the same bit num with the input, or we can do the operation to mask on only half bits of the input. Example: Masking on the higher nibble (bits 4, 5, 6, 7) while leaving the lower nibble (bits 0, 1, 2, 3) unchanged. ``` 10010101 10100101 OR 11110000 11110000 = 11110101 11110101 ``` ## C code This is 32bit version, the function **generate_bitmask** will call the function **count_leading_zeros**, and return the bitmask which is full of 1 and has the same bit with the input data. ```c=1 #include <stdio.h> #include <stdint.h> uint16_t count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555 ); x = ((x >> 2) & 0x33333333) + (x & 0x33333333 ); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } uint32_t generate_bitmask(uint32_t x) { uint16_t leading_zeros = count_leading_zeros(x); if (leading_zeros==0) return 0xffffffff; return (1 << (32 - leading_zeros)) - 1 ; } int main() { uint32_t test_data[] = {0, 4, 0x80000000}; for (int i = 0; i < 3; i++){ printf("The reslut of %x is %x\n", test_data[i],generate_bitmask(test_data[i])); } } ``` ![](https://hackmd.io/_uploads/HkYob6-WT.png) We can generate the different bitmask by only changing the return statement in the **generate_bitmask**! Example: Generate the bitmask that set the left half of input into 0 ```c uint32_t generate_bitmask(uint32_t x) { uint16_t leading_zeros = count_leading_zeros(x); if (leading_zeros==0) return 0xffffffff; return (1 << ((32 - leading_zeros)/2)) - 1; } ``` If input x = **0b10110011**, then **leading_zeros** = 24, we will get the return (1 << (32 - 24) / 2) - 1 = **0b1111**, and use **and** operation to set the left half bit to 0. ``` 10110011 and 00001111 00000011 ``` ## **uint32_t problem** When the leading_zero is 0 (i.e. input is bigger than 0x80000000), the return becomes (1 << 32 ) - 1, equal to 0 - 1 = 0, because uint32_t doesn't have negative numbers, hence we need a condition statement **if (leading_zeros==0) return 0xffffffff** to handle it, but in the assembly code 1<<32 - 1 will return 0xffffffff which is what we want, therefore we can skip the condition statement in the assembly code. ```c uint32_t generate_bitmask(uint32_t x) { uint16_t leading_zeros = count_leading_zeros(x); if (leading_zeros==0) return 0xffffffff; return (1 << (32 - leading_zeros)) - 1 ; } ``` ## Assembly code My program starts from **main**, next will enter **loop** and jump to **generateBitmask**, return address will store in **ra** register, the first thing we jump to **generateBitmask** is to store ra in the stack, because **generateBitmask** will call **clz** and **ra** needs to store the address that jump back to **generateBitmask** when finish **clz**, **clz** will store result in **a0**, after back to **generateMask**, **generateMask** will compute the result and load the return address from stack to go back to **loop**, **loop** will load the input again for **printResult** and jump to **printResult**, after printing the result, the program will back to **loop** to do loop control, checking the test finish or not, if not it will continue the procedure above until finish all the test data. ```c .data test: .word 0,4,0x80000000 str1: .string "The reslut of " str2: .string " is " str3: .string "\n" .text main: la a2, test li t3, 3 loop: lw a0, 0(a2) jal ra, generateBitmask # Print the result to console mv a1, a0 lw a0, 0(a2) jal ra, printResult # Loop control addi a2, a2, 4 addi t3, t3, -1 bnez t3,loop # Exit program li a7, 10 ecall generateBitmask: addi sp, sp, -4 sw ra, 0(sp) jal ra, clz # return (1 << (32 - leading_zeros)) - 1; li t0, 32 sub t0, t0, a0 li t1, 1 sll a0, t1, t0 addi a0, a0, -1 lw ra,0(sp) addi sp, sp,4 jr ra clz: # x |= (x>>1) srli t0, a0, 1 or a0, t0, a0 # x |= (x>>2) srli t0, a0, 2 or a0, t0, a0 # x |= (x>>4) srli t0, a0, 4 or a0, t0, a0 # x |= (x>>8) srli t0, a0, 8 or a0, t0, a0 # x |= (x>>16) srli t0, a0, 16 or a0, t0, a0 # x -= ((x >> 1) & 0x55555555) srli t0, a0, 1 li t1, 0x55555555 and t0, t0, t1 sub a0, a0, t0 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333) srli t0, a0, 2 li t1, 0x33333333 and t0, t0, t1 and t2, a0, t1 add a0, t0, t2 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t0, a0, 4 add t0, t0, a0 li t1, 0x0f0f0f0f and a0, t0, t1 # x += (x >> 8); srli t0, a0, 8 add a0, a0, t0 # x += (x >> 16); srli t0, a0, 16 add a0, a0, t0 # return (32 - (x & 0x3f)) li t0, 0x3f and a0, a0, t0 li t0, 32 sub a0, t0, a0 ret # a0: input value # a1: bitmask printResult: mv t0, a0 mv t1, a1 la a0, str1 li a7, 4 ecall mv a0, t0 li a7, 35 ecall la a0, str2 li a7, 4 ecall mv a0, t1 li a7, 35 ecall la a0, str3 li a7, 4 ecall ret ``` ## Console ![](https://hackmd.io/_uploads/rkhESFg-T.png) ## Pipeline stage When I were writing my code, the bug I face the most is about return address, so I will focus on the **jal** instruction in the pipeline stage! ### IF stage In the **IF** stage, the processor fetch instruction to be executed from memory, besides, it will send the address from the program counter (PC) to memory to get the next instruction's content. In this case, the instruction **jal ra, generateBitmask** is at the 0x00000010 in the memory, the **PC** will add 4 to get next instruction. Besides, the **x1** register stores **ra**. Right now, the **ra** hasn't been changed, hence it is still 0x00000000. ![](https://hackmd.io/_uploads/ByyI3l1W6.png) ![](https://hackmd.io/_uploads/r1SphgkbT.png) ![IF stage](https://hackmd.io/_uploads/SJ0hjekba.png) ### ID stage In the **ID** stage, the processor decodes the fetched instruction to determine its opcode and operands. The **Imm** is 24, it is because the **generateBitmask** is at the 0x00000034. (34-10=24) ![](https://hackmd.io/_uploads/HkHXw-kWp.png) ![](https://hackmd.io/_uploads/r1rmQ-yb6.png) ### EX stage In the **EX** stage, the processor performs the arithmetic and logic operations, memory address calculations, and more. In this **jal** instruction, **ALU** compute the 0x0000000010 + 0x00000024 = 0x00000034, which is the branch address, hence 0x00000034 will be passed to the PC for fetching the next correct instruction. ![](https://hackmd.io/_uploads/rJBP_ZJ-p.png) ![](https://hackmd.io/_uploads/ryNi3b1-T.png) ### MEM stage In the **MEM** stage, the processor performs reading data from memory, writing data to memory, or performing other memory-related operations. In this case, it will still read the memory **0x00000034** and read out **0xffc10113**, but it will do nothing. In addition to, **0x00000014** (return address) passes through pipeline and the **0x01** represent the write register index i.e. **ra** ![](https://hackmd.io/_uploads/HJmPRby-T.png) ![](https://hackmd.io/_uploads/Hy8w1zkZp.png) ### WB stage In the **WB** stage, the processor writes the results obtained from the **EXE** stage back to the register file. The contorl signal will let **0x00000014** pass through the **MUX** and write it back to the **ra** register. ![](https://hackmd.io/_uploads/rkOt7M1ZT.png) ![](https://hackmd.io/_uploads/ry8b7f1Wa.png) After this stage, the **ra** register becomes **0x00000014** ![](https://hackmd.io/_uploads/Bylt4fy-6.png) ## In the generateBitmask function ![](https://hackmd.io/_uploads/BJuwrfJW6.png) **sp** will minus 4,from **0x7ffffff0** to **0x7fffffec**. ![](https://hackmd.io/_uploads/BJP1UMy-6.png) The memory **0x7fffffec** will store **0x00000014** which is the next instruction after the execute of **generateBitmask** function ![](https://hackmd.io/_uploads/BknxLfkWT.png) ![](https://hackmd.io/_uploads/HyXWDfJba.png) The **ra** will become **0x00000040** which is the next instruction after the finish of clz function. ![](https://hackmd.io/_uploads/S1mwdG1-T.png) ![](https://hackmd.io/_uploads/B1ihuz1bp.png) After finishing the clz function, the program will jump back to **generateBitmask** function by **ra** to return value, hence before we finish the **generateBitmask** function we need to load the correct return address from the stack and add the **sp** back. ![](https://hackmd.io/_uploads/BJk19MkZa.png) ## Excutable code by Ripes ``` 00000000 <main>: 0: 10000617 auipc x12 0x10000 4: 00060613 addi x12 x12 0 8: 00300e13 addi x28 x0 3 0000000c <loop>: c: 00062503 lw x10 0 x12 10: 024000ef jal x1 36 <generateBitmask> 14: 00050593 addi x11 x10 0 18: 00062503 lw x10 0 x12 1c: 0d0000ef jal x1 208 <printResult> 20: 00460613 addi x12 x12 4 24: fffe0e13 addi x28 x28 -1 28: fe0e12e3 bne x28 x0 -28 <loop> 2c: 00a00893 addi x17 x0 10 30: 00000073 ecall 00000034 <generateBitmask>: 34: ffc10113 addi x2 x2 -4 38: 00112023 sw x1 0 x2 3c: 024000ef jal x1 36 <clz> 40: 02000293 addi x5 x0 32 44: 40a282b3 sub x5 x5 x10 48: 00100313 addi x6 x0 1 4c: 00531533 sll x10 x6 x5 50: fff50513 addi x10 x10 -1 54: 00012083 lw x1 0 x2 58: 00410113 addi x2 x2 4 5c: 00008067 jalr x0 x1 0 00000060 <clz>: 60: 00155293 srli x5 x10 1 64: 00a2e533 or x10 x5 x10 68: 00255293 srli x5 x10 2 6c: 00a2e533 or x10 x5 x10 70: 00455293 srli x5 x10 4 74: 00a2e533 or x10 x5 x10 78: 00855293 srli x5 x10 8 7c: 00a2e533 or x10 x5 x10 80: 01055293 srli x5 x10 16 84: 00a2e533 or x10 x5 x10 88: 00155293 srli x5 x10 1 8c: 55555337 lui x6 0x55555 90: 55530313 addi x6 x6 1365 94: 0062f2b3 and x5 x5 x6 98: 40550533 sub x10 x10 x5 9c: 00255293 srli x5 x10 2 a0: 33333337 lui x6 0x33333 a4: 33330313 addi x6 x6 819 a8: 0062f2b3 and x5 x5 x6 ac: 006573b3 and x7 x10 x6 b0: 00728533 add x10 x5 x7 b4: 00455293 srli x5 x10 4 b8: 00a282b3 add x5 x5 x10 bc: 0f0f1337 lui x6 0xf0f1 c0: f0f30313 addi x6 x6 -241 c4: 0062f533 and x10 x5 x6 c8: 00855293 srli x5 x10 8 cc: 00550533 add x10 x10 x5 d0: 01055293 srli x5 x10 16 d4: 00550533 add x10 x10 x5 d8: 03f00293 addi x5 x0 63 dc: 00557533 and x10 x10 x5 e0: 02000293 addi x5 x0 32 e4: 40a28533 sub x10 x5 x10 e8: 00008067 jalr x0 x1 0 000000ec <printResult>: ec: 00050293 addi x5 x10 0 f0: 00058313 addi x6 x11 0 f4: 10000517 auipc x10 0x10000 f8: f1850513 addi x10 x10 -232 fc: 00400893 addi x17 x0 4 100: 00000073 ecall 104: 00028513 addi x10 x5 0 108: 00100893 addi x17 x0 1 10c: 00000073 ecall 110: 10000517 auipc x10 0x10000 114: f0b50513 addi x10 x10 -245 118: 00400893 addi x17 x0 4 11c: 00000073 ecall 120: 00030513 addi x10 x6 0 124: 00100893 addi x17 x0 1 128: 00000073 ecall 12c: 10000517 auipc x10 0x10000 130: ef450513 addi x10 x10 -268 134: 00400893 addi x17 x0 4 138: 00000073 ecall 13c: 00008067 jalr x0 x1 0 ``` ## Optimize When I am writing this HackMD report, I found that my code actually has some redundant part, so I tried to reduce the code size and minimize runtime overhead. **I notice that every time I run my program, the clock rate will be different in the same code.** But the Cycles, Instrs. retired, CPI, IPC are always the same, hence I will focus on the compare of Cycles and CPI, especially in the condition that the clock rate is almost the same. Full view of all version code [github](https://github.com/yuchen0620/Computer-Architecture/tree/main/HW1) The code above is version_1 ![](https://hackmd.io/_uploads/H1BIu8lZT.png) ![](https://hackmd.io/_uploads/rk4Fafy-p.png) In the version_2, I change the **printResult** into **loop**, because in the version_1, when the program finishing the **generateBitmask**, it will jump to **loop** to do only 2 instruction **lw t0, 0(a2)** (load input value into t0) **mv t1, a0** (move bitmask to t1) and then jump to **printResult**, after finishing the **printResult**, the program will jump back to **loop** again, hence I think that it is a little waste of time. ![](https://hackmd.io/_uploads/r18wSux-6.png) **Cycles from 306 to 282 CPI from 1.33 to 1.29** ![](https://hackmd.io/_uploads/S16bu8eZa.png) **The text size from 320 to 304** ![](https://hackmd.io/_uploads/H1yZcelZ6.png) In the version_3, I change the **generateBitmask** function, the original one used the stack to store the return address temporary, hence it needed 5 instructions. **addi sp, sp, -4 sw ra, 0(sp) ... lw ra,0(sp) addi sp, sp,4 jr ra** I change it into 2 instruction by using register s0 **mv s0, ra jalr x0, s0, 0** **Cycles from 282 to 273** **CPI from 1.29 to 1.31** ![](https://hackmd.io/_uploads/HkhcuIxW6.png) **The text size from 304 to 292** ![](https://hackmd.io/_uploads/Sktj_8gZa.png) In the version_4, I tried the version without loop. Besides, I move the printResult into the **generateBitmask**, because I don't want to have three the same section to do **printResult** in main function. ```c main: la a2, test lw a0, 0(a2) jal ra, generateBitmask lw a0, 4(a2) jal ra, generateBitmask lw a0, 8(a2) jal ra, generateBitmask # Exit program li a7, 10 ecall generateBitmask: mv s0, ra mv a1, a0 #store input value jal ra, clz # return (1 << (32 - leading_zeros)) - 1; li t0, 32 sub t0, t0, a0 li t1, 1 sll a0, t1, t0 addi a0, a0, -1 # Print the result to console # a1: input value # t1: bitmask mv t1, a0 la a0, str1 li a7, 4 ecall mv a0, a1 li a7, 35 ecall la a0, str2 li a7, 4 ecall mv a0, t1 li a7, 35 ecall la a0, str3 li a7, 4 ecall jalr x0, s0, 0 ``` **Cycles from 273 to 259** **CPI from 1.31 to 1.3** ![](https://hackmd.io/_uploads/SJ6EADe-p.png) **The text size is the same (292).** ![](https://hackmd.io/_uploads/BJGLCvl-6.png) ## 64bit version (2023/11/1) I refer to [丁竟烽](https://hackmd.io/@Paintako/CA_HW02)'s HW2, he optimizes my 32bit version code. ```diff uint32_t generate_bitmask(uint32_t x) { uint16_t leading_zeros = count_leading_zeros(x); if (leading_zeros == 0) return 0xffffffff; - return (1 << (32 - leading_zeros)) - 1 ; + return 0xffffffff >> leading_zeros; } ``` But there is a small mistake in his code, if the input = 0, leading_zeros will be 32, and the program will return **0xffffffff >> 32 = 0xffffffff**. Because C will do **0xffffffff >> 32 (mod 32) = 0xffffffff >> 0 = 0xffffffff**. But the answer should be 0. Therefore we need to change our edge condition. ```diff uint32_t generate_bitmask(uint32_t x) { uint16_t leading_zeros = count_leading_zeros(x); - if (leading_zeros == 0) return 0xffffffff; + if (leading_zeros ==32) return 0; return 0xffffffff >> leading_zeros; } ``` Besides, I also implement the 64bit version of generate_bitmask! ```c #include <stdio.h> #include <stdint.h> uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & 0x5555555555555555 ); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333 ); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } uint64_t generate_bitmask(uint64_t x) { uint16_t leading_zeros = count_leading_zeros(x); if (leading_zeros==64) return 0; return 0xffffffffffffffff >> leading_zeros; } int main() { uint64_t test_data[] = {0, 4, 0x8000000000000000}; for (int i = 0; i < 3; i++){ printf("The reslut of %llx is %llx\n", test_data[i], generate_bitmask(test_data[i])); } } ``` ## Assembly code - 64bit ```c .data test: .word 0,0,0,4,0x80000000,0 # data1=0, data2=4, data3=0x80000000000000000 str1: .string "The reslut of " str2: .string " is " str3: .string " " str4: .string "\n" .text main: la a2, test li t4, 3 loop: lw a1, 0(a2) lw a0, 4(a2) jal ra, generateBitmask # Print the result to console mv t0, a0 mv t1, a1 lw a1, 0(a2) lw a0, 4(a2) jal ra, printResult # Loop control addi a2, a2, 8 addi t4, t4, -1 bnez t4,loop # Exit program li a7, 10 ecall generateBitmask: addi sp, sp, -4 sw ra, 0(sp) jal ra, clz # return 0xffffffffffffffff >> leading_zeros; li t0, 32 bge a0, t0, leading_zeros_bigger_32 # leading_zeros_smaller_32 li t2,0xffffffff # high32bit srl a1, t2, a0 li a0, 0xffffffff # low32bit lw ra,0(sp) addi sp, sp,4 jr ra leading_zeros_bigger_32: addi a0, a0, -32 li t2, 0xffffffff # low32bit srl a0, t2, a0 li a1, 0 # high32bit lw ra,0(sp) addi sp, sp,4 jr ra clz: # x |= (x>>1) srli t1, a0, 1 slli t2, a1, 31 or t1, t1, t2 srli t2, a1, 1 or a0, a0, t1 or a1, a1, t2 # x |= (x>>2) srli t1, a0, 2 slli t2, a1, 30 or t1, t1, t2 srli t2, a1, 2 or a0, a0, t1 or a1, a1, t2 # x |= (x>>4) srli t1, a0, 4 slli t2, a1, 28 or t1, t1, t2 srli t2, a1, 4 or a0, a0, t1 or a1, a1, t2 # x |= (x>>8) srli t1, a0, 8 slli t2, a1, 24 or t1, t1, t2 srli t2, a1, 8 or a0, a0, t1 or a1, a1, t2 # x |= (x>>16) srli t1, a0, 16 slli t2, a1, 16 or t1, t1, t2 srli t2, a1, 16 or a0, a0, t1 or a1, a1, t2 # x |= (x>>32) or a0, a0, a1 # x -= ((x >> 1) & 0x5555555555555555) srli t0, a0, 1 slli t1, a1, 31 or t0, t0, t1 li t2, 0x55555555 and t0, t0 ,t2 srli t1, a1, 1 and t1, t1, t2 sub a0, a0, t0 sub a1, a1, t1 # x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333) srli t0, a0, 2 slli t1, a1, 30 or t0, t0, t1 li t2, 0x33333333 and t0, t0, t2 srli t1, a1, 2 and t1, t1, t2 and a0, a0, t2 and a1, a1, t2 add a0, a0, t0 add a1, a1, t1 # x = ((x >> 4) + x) & 0x0f0f0f0f srli t0, a0, 4 slli t1, a1, 28 or t0, t0, t1 srli t1, a1, 4 add a0, a0, t0 add a1, a1, t1 li t0, 0x0f0f0f0f and a0, a0, t0 and a1, a1, t0 # x += (x >> 8) srli t0, a0, 8 slli t1, a1, 24 or t0, t0, t1 srli t1, a1, 8 add a0, a0, t0 add a1, a1, t1 # x += (x >> 16) srli t0, a0, 16 slli t1, a1, 16 or t0, t0, t1 srli t1, a1, 16 add a0, a0, t0 add a1, a1, t1 # x += (x >> 32) add a0, a0, a1 # return (64 - (x & 0x7f)) li t0, 0x7f and a0, a0, t0 li t0, 64 sub a0, t0, a0 ret # t0: low 32bit of bitmask # t1: high 32bit of bitmask # a0: low 32bit of input # a1: high 32bit of input printResult: mv t2, a0 mv t3, a1 la a0, str1 li a7, 4 ecall mv a0, t3 li a7, 35 ecall la a0, str3 li a7, 4 ecall mv a0, t2 li a7, 35 ecall la a0, str2 li a7, 4 ecall mv a0, t1 li a7, 35 ecall la a0, str3 li a7, 4 ecall mv a0, t0 li a7, 35 ecall la a0, str4 li a7, 4 ecall ret ``` Output ``` The reslut of 0b00000000000000000000000000000000 0b00000000000000000000000000000000 is 0b00000000000000000000000000000000 0b00000000000000000000000000000000 The reslut of 0b00000000000000000000000000000000 0b00000000000000000000000000000100 is 0b00000000000000000000000000000000 0b00000000000000000000000000000111 The reslut of 0b10000000000000000000000000000000 0b00000000000000000000000000000000 is 0b11111111111111111111111111111111 0b11111111111111111111111111111111 ``` ![](https://hackmd.io/_uploads/BJRXa0RG6.png) ![](https://hackmd.io/_uploads/B1C4600Ga.png)