# Generate bitmask by CLZ
contributed by <[yuchen0620](https://github.com/yuchen0620)>
[Bitmask](https://en.wikipedia.org/wiki/Mask_(computing)) is common used in computer science field. Using a mask, multiple bits in a byte, [nibble](https://en.wikipedia.org/wiki/Nibble), word, etc. can be set either on or off, or inverted from on to off (or vice versa) in a single bitwise operation. Hence, I use Count Leading Zeros(CLZ) to generate bitmask!With CLZ, we can generate bitmask which has the same bit num with the input, or we can do the operation to mask on only half bits of the input.
Example: Masking on the higher nibble (bits 4, 5, 6, 7) while leaving the lower nibble (bits 0, 1, 2, 3) unchanged.
```
10010101 10100101
OR 11110000 11110000
= 11110101 11110101
```
## C code
This is 32bit version, the function **generate_bitmask** will call the function **count_leading_zeros**, and return the bitmask which is full of 1 and has the same bit with the input data.
```c=1
#include <stdio.h>
#include <stdint.h>
uint16_t count_leading_zeros(uint32_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555 );
x = ((x >> 2) & 0x33333333) + (x & 0x33333333 );
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x3f));
}
uint32_t generate_bitmask(uint32_t x)
{
uint16_t leading_zeros = count_leading_zeros(x);
if (leading_zeros==0) return 0xffffffff;
return (1 << (32 - leading_zeros)) - 1 ;
}
int main()
{
uint32_t test_data[] = {0, 4, 0x80000000};
for (int i = 0; i < 3; i++){
printf("The reslut of %x is %x\n", test_data[i],generate_bitmask(test_data[i]));
}
}
```

We can generate the different bitmask by only changing the return statement in the **generate_bitmask**!
Example: Generate the bitmask that set the left half of input into 0
```c
uint32_t generate_bitmask(uint32_t x)
{
uint16_t leading_zeros = count_leading_zeros(x);
if (leading_zeros==0) return 0xffffffff;
return (1 << ((32 - leading_zeros)/2)) - 1;
}
```
If input x = **0b10110011**, then **leading_zeros** = 24,
we will get the return (1 << (32 - 24) / 2) - 1 = **0b1111**, and use **and** operation to set the left half bit to 0.
```
10110011
and 00001111
00000011
```
## **uint32_t problem**
When the leading_zero is 0 (i.e. input is bigger than 0x80000000), the return becomes (1 << 32 ) - 1, equal to 0 - 1 = 0, because uint32_t doesn't have negative numbers, hence we need a condition statement **if (leading_zeros==0) return 0xffffffff** to handle it, but in the assembly code 1<<32 - 1 will return 0xffffffff which is what we want, therefore we can skip the condition statement in the assembly code.
```c
uint32_t generate_bitmask(uint32_t x)
{
uint16_t leading_zeros = count_leading_zeros(x);
if (leading_zeros==0) return 0xffffffff;
return (1 << (32 - leading_zeros)) - 1 ;
}
```
## Assembly code
My program starts from **main**, next will enter **loop** and jump to **generateBitmask**, return address will store in **ra** register, the first thing we jump to **generateBitmask** is to store ra in the stack, because **generateBitmask** will call **clz** and **ra** needs to store the address that jump back to **generateBitmask** when finish **clz**, **clz** will store result in **a0**, after back to **generateMask**, **generateMask** will compute the result and load the return address from stack to go back to **loop**, **loop** will load the input again for **printResult** and jump to **printResult**,
after printing the result, the program will back to **loop** to do loop control, checking the test finish or not, if not it will continue the procedure above until finish all the test data.
```c
.data
test: .word 0,4,0x80000000
str1: .string "The reslut of "
str2: .string " is "
str3: .string "\n"
.text
main:
la a2, test
li t3, 3
loop:
lw a0, 0(a2)
jal ra, generateBitmask
# Print the result to console
mv a1, a0
lw a0, 0(a2)
jal ra, printResult
# Loop control
addi a2, a2, 4
addi t3, t3, -1
bnez t3,loop
# Exit program
li a7, 10
ecall
generateBitmask:
addi sp, sp, -4
sw ra, 0(sp)
jal ra, clz
# return (1 << (32 - leading_zeros)) - 1;
li t0, 32
sub t0, t0, a0
li t1, 1
sll a0, t1, t0
addi a0, a0, -1
lw ra,0(sp)
addi sp, sp,4
jr ra
clz:
# x |= (x>>1)
srli t0, a0, 1
or a0, t0, a0
# x |= (x>>2)
srli t0, a0, 2
or a0, t0, a0
# x |= (x>>4)
srli t0, a0, 4
or a0, t0, a0
# x |= (x>>8)
srli t0, a0, 8
or a0, t0, a0
# x |= (x>>16)
srli t0, a0, 16
or a0, t0, a0
# x -= ((x >> 1) & 0x55555555)
srli t0, a0, 1
li t1, 0x55555555
and t0, t0, t1
sub a0, a0, t0
# x = ((x >> 2) & 0x33333333) + (x & 0x33333333)
srli t0, a0, 2
li t1, 0x33333333
and t0, t0, t1
and t2, a0, t1
add a0, t0, t2
# x = ((x >> 4) + x) & 0x0f0f0f0f
srli t0, a0, 4
add t0, t0, a0
li t1, 0x0f0f0f0f
and a0, t0, t1
# x += (x >> 8);
srli t0, a0, 8
add a0, a0, t0
# x += (x >> 16);
srli t0, a0, 16
add a0, a0, t0
# return (32 - (x & 0x3f))
li t0, 0x3f
and a0, a0, t0
li t0, 32
sub a0, t0, a0
ret
# a0: input value
# a1: bitmask
printResult:
mv t0, a0
mv t1, a1
la a0, str1
li a7, 4
ecall
mv a0, t0
li a7, 35
ecall
la a0, str2
li a7, 4
ecall
mv a0, t1
li a7, 35
ecall
la a0, str3
li a7, 4
ecall
ret
```
## Console

## Pipeline stage
When I were writing my code, the bug I face the most is about return address, so I will focus on the **jal** instruction in the pipeline stage!
### IF stage
In the **IF** stage, the processor fetch instruction to be executed from memory, besides, it will send the address from the program counter (PC) to memory to get the next instruction's content.
In this case, the instruction
**jal ra, generateBitmask** is at the 0x00000010 in the memory, the **PC** will add 4 to get next instruction. Besides, the **x1** register stores **ra**.
Right now, the **ra** hasn't been changed, hence it is still 0x00000000.



### ID stage
In the **ID** stage, the processor decodes the fetched instruction to determine its opcode and operands.
The **Imm** is 24, it is because the **generateBitmask** is at the 0x00000034.
(34-10=24)


### EX stage
In the **EX** stage, the processor performs the arithmetic and logic operations, memory address calculations, and more.
In this **jal** instruction, **ALU** compute the 0x0000000010 + 0x00000024 = 0x00000034, which is the branch address, hence 0x00000034 will be passed to the PC for fetching the next correct instruction.


### MEM stage
In the **MEM** stage, the processor performs reading data from memory, writing data to memory, or performing other memory-related operations.
In this case, it will still read the memory **0x00000034** and read out **0xffc10113**, but it will do nothing. In addition to, **0x00000014** (return address) passes through pipeline and the **0x01** represent the write register index i.e. **ra**


### WB stage
In the **WB** stage, the processor writes the results obtained from the **EXE** stage back to the register file.
The contorl signal will let **0x00000014** pass through the **MUX** and write it back to the **ra** register.


After this stage, the **ra** register becomes **0x00000014**

## In the generateBitmask function

**sp** will minus 4,from **0x7ffffff0** to **0x7fffffec**.

The memory **0x7fffffec** will store **0x00000014** which is the next instruction after the execute of **generateBitmask** function


The **ra** will become **0x00000040** which is the next instruction after the finish of clz function.


After finishing the clz function, the program will jump back to **generateBitmask** function by **ra** to return value, hence before we finish the **generateBitmask** function we need to load the correct return address from the stack and add the **sp** back.

## Excutable code by Ripes
```
00000000 <main>:
0: 10000617 auipc x12 0x10000
4: 00060613 addi x12 x12 0
8: 00300e13 addi x28 x0 3
0000000c <loop>:
c: 00062503 lw x10 0 x12
10: 024000ef jal x1 36 <generateBitmask>
14: 00050593 addi x11 x10 0
18: 00062503 lw x10 0 x12
1c: 0d0000ef jal x1 208 <printResult>
20: 00460613 addi x12 x12 4
24: fffe0e13 addi x28 x28 -1
28: fe0e12e3 bne x28 x0 -28 <loop>
2c: 00a00893 addi x17 x0 10
30: 00000073 ecall
00000034 <generateBitmask>:
34: ffc10113 addi x2 x2 -4
38: 00112023 sw x1 0 x2
3c: 024000ef jal x1 36 <clz>
40: 02000293 addi x5 x0 32
44: 40a282b3 sub x5 x5 x10
48: 00100313 addi x6 x0 1
4c: 00531533 sll x10 x6 x5
50: fff50513 addi x10 x10 -1
54: 00012083 lw x1 0 x2
58: 00410113 addi x2 x2 4
5c: 00008067 jalr x0 x1 0
00000060 <clz>:
60: 00155293 srli x5 x10 1
64: 00a2e533 or x10 x5 x10
68: 00255293 srli x5 x10 2
6c: 00a2e533 or x10 x5 x10
70: 00455293 srli x5 x10 4
74: 00a2e533 or x10 x5 x10
78: 00855293 srli x5 x10 8
7c: 00a2e533 or x10 x5 x10
80: 01055293 srli x5 x10 16
84: 00a2e533 or x10 x5 x10
88: 00155293 srli x5 x10 1
8c: 55555337 lui x6 0x55555
90: 55530313 addi x6 x6 1365
94: 0062f2b3 and x5 x5 x6
98: 40550533 sub x10 x10 x5
9c: 00255293 srli x5 x10 2
a0: 33333337 lui x6 0x33333
a4: 33330313 addi x6 x6 819
a8: 0062f2b3 and x5 x5 x6
ac: 006573b3 and x7 x10 x6
b0: 00728533 add x10 x5 x7
b4: 00455293 srli x5 x10 4
b8: 00a282b3 add x5 x5 x10
bc: 0f0f1337 lui x6 0xf0f1
c0: f0f30313 addi x6 x6 -241
c4: 0062f533 and x10 x5 x6
c8: 00855293 srli x5 x10 8
cc: 00550533 add x10 x10 x5
d0: 01055293 srli x5 x10 16
d4: 00550533 add x10 x10 x5
d8: 03f00293 addi x5 x0 63
dc: 00557533 and x10 x10 x5
e0: 02000293 addi x5 x0 32
e4: 40a28533 sub x10 x5 x10
e8: 00008067 jalr x0 x1 0
000000ec <printResult>:
ec: 00050293 addi x5 x10 0
f0: 00058313 addi x6 x11 0
f4: 10000517 auipc x10 0x10000
f8: f1850513 addi x10 x10 -232
fc: 00400893 addi x17 x0 4
100: 00000073 ecall
104: 00028513 addi x10 x5 0
108: 00100893 addi x17 x0 1
10c: 00000073 ecall
110: 10000517 auipc x10 0x10000
114: f0b50513 addi x10 x10 -245
118: 00400893 addi x17 x0 4
11c: 00000073 ecall
120: 00030513 addi x10 x6 0
124: 00100893 addi x17 x0 1
128: 00000073 ecall
12c: 10000517 auipc x10 0x10000
130: ef450513 addi x10 x10 -268
134: 00400893 addi x17 x0 4
138: 00000073 ecall
13c: 00008067 jalr x0 x1 0
```
## Optimize
When I am writing this HackMD report, I found that my code actually has some redundant part, so I tried to reduce the code size and minimize runtime overhead.
**I notice that every time I run my program, the clock rate will be different in the same code.**
But the Cycles, Instrs. retired, CPI, IPC are always the same, hence I will focus on the compare of Cycles and CPI, especially in the condition that the clock rate is almost the same.
Full view of all version code [github](https://github.com/yuchen0620/Computer-Architecture/tree/main/HW1)
The code above is version_1


In the version_2, I change the **printResult** into **loop**, because in the version_1, when the program finishing the **generateBitmask**, it will jump to **loop** to do only 2 instruction
**lw t0, 0(a2)** (load input value into t0)
**mv t1, a0** (move bitmask to t1)
and then jump to **printResult**, after finishing the **printResult**, the program will jump back to **loop** again, hence I think that it is a little waste of time.

**Cycles from 306 to 282
CPI from 1.33 to 1.29**

**The text size from 320 to 304**

In the version_3, I change the **generateBitmask** function, the original one used the stack to store the return address temporary, hence it needed 5 instructions.
**addi sp, sp, -4
sw ra, 0(sp)
...
lw ra,0(sp)
addi sp, sp,4
jr ra**
I change it into 2 instruction by using register s0
**mv s0, ra
jalr x0, s0, 0**
**Cycles from 282 to 273**
**CPI from 1.29 to 1.31**

**The text size from 304 to 292**

In the version_4, I tried the version without loop. Besides, I move the printResult into the **generateBitmask**, because I don't want to have three the same section to do **printResult** in main function.
```c
main:
la a2, test
lw a0, 0(a2)
jal ra, generateBitmask
lw a0, 4(a2)
jal ra, generateBitmask
lw a0, 8(a2)
jal ra, generateBitmask
# Exit program
li a7, 10
ecall
generateBitmask:
mv s0, ra
mv a1, a0 #store input value
jal ra, clz
# return (1 << (32 - leading_zeros)) - 1;
li t0, 32
sub t0, t0, a0
li t1, 1
sll a0, t1, t0
addi a0, a0, -1
# Print the result to console
# a1: input value
# t1: bitmask
mv t1, a0
la a0, str1
li a7, 4
ecall
mv a0, a1
li a7, 35
ecall
la a0, str2
li a7, 4
ecall
mv a0, t1
li a7, 35
ecall
la a0, str3
li a7, 4
ecall
jalr x0, s0, 0
```
**Cycles from 273 to 259**
**CPI from 1.31 to 1.3**

**The text size is the same (292).**

## 64bit version (2023/11/1)
I refer to [丁竟烽](https://hackmd.io/@Paintako/CA_HW02)'s HW2, he optimizes my 32bit version code.
```diff
uint32_t generate_bitmask(uint32_t x)
{
uint16_t leading_zeros = count_leading_zeros(x);
if (leading_zeros == 0) return 0xffffffff;
- return (1 << (32 - leading_zeros)) - 1 ;
+ return 0xffffffff >> leading_zeros;
}
```
But there is a small mistake in his code, if the input = 0, leading_zeros will be 32, and the program will return **0xffffffff >> 32 = 0xffffffff**. Because C will do **0xffffffff >> 32 (mod 32) = 0xffffffff >> 0 = 0xffffffff**. But the answer should be 0. Therefore we need to change our edge condition.
```diff
uint32_t generate_bitmask(uint32_t x)
{
uint16_t leading_zeros = count_leading_zeros(x);
- if (leading_zeros == 0) return 0xffffffff;
+ if (leading_zeros ==32) return 0;
return 0xffffffff >> leading_zeros;
}
```
Besides, I also implement the 64bit version of generate_bitmask!
```c
#include <stdio.h>
#include <stdint.h>
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555 );
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333 );
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
uint64_t generate_bitmask(uint64_t x)
{
uint16_t leading_zeros = count_leading_zeros(x);
if (leading_zeros==64) return 0;
return 0xffffffffffffffff >> leading_zeros;
}
int main()
{
uint64_t test_data[] = {0, 4, 0x8000000000000000};
for (int i = 0; i < 3; i++){
printf("The reslut of %llx is %llx\n", test_data[i], generate_bitmask(test_data[i]));
}
}
```
## Assembly code - 64bit
```c
.data
test: .word 0,0,0,4,0x80000000,0 # data1=0, data2=4, data3=0x80000000000000000
str1: .string "The reslut of "
str2: .string " is "
str3: .string " "
str4: .string "\n"
.text
main:
la a2, test
li t4, 3
loop:
lw a1, 0(a2)
lw a0, 4(a2)
jal ra, generateBitmask
# Print the result to console
mv t0, a0
mv t1, a1
lw a1, 0(a2)
lw a0, 4(a2)
jal ra, printResult
# Loop control
addi a2, a2, 8
addi t4, t4, -1
bnez t4,loop
# Exit program
li a7, 10
ecall
generateBitmask:
addi sp, sp, -4
sw ra, 0(sp)
jal ra, clz
# return 0xffffffffffffffff >> leading_zeros;
li t0, 32
bge a0, t0, leading_zeros_bigger_32
# leading_zeros_smaller_32
li t2,0xffffffff # high32bit
srl a1, t2, a0
li a0, 0xffffffff # low32bit
lw ra,0(sp)
addi sp, sp,4
jr ra
leading_zeros_bigger_32:
addi a0, a0, -32
li t2, 0xffffffff # low32bit
srl a0, t2, a0
li a1, 0 # high32bit
lw ra,0(sp)
addi sp, sp,4
jr ra
clz:
# x |= (x>>1)
srli t1, a0, 1
slli t2, a1, 31
or t1, t1, t2
srli t2, a1, 1
or a0, a0, t1
or a1, a1, t2
# x |= (x>>2)
srli t1, a0, 2
slli t2, a1, 30
or t1, t1, t2
srli t2, a1, 2
or a0, a0, t1
or a1, a1, t2
# x |= (x>>4)
srli t1, a0, 4
slli t2, a1, 28
or t1, t1, t2
srli t2, a1, 4
or a0, a0, t1
or a1, a1, t2
# x |= (x>>8)
srli t1, a0, 8
slli t2, a1, 24
or t1, t1, t2
srli t2, a1, 8
or a0, a0, t1
or a1, a1, t2
# x |= (x>>16)
srli t1, a0, 16
slli t2, a1, 16
or t1, t1, t2
srli t2, a1, 16
or a0, a0, t1
or a1, a1, t2
# x |= (x>>32)
or a0, a0, a1
# x -= ((x >> 1) & 0x5555555555555555)
srli t0, a0, 1
slli t1, a1, 31
or t0, t0, t1
li t2, 0x55555555
and t0, t0 ,t2
srli t1, a1, 1
and t1, t1, t2
sub a0, a0, t0
sub a1, a1, t1
# x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333)
srli t0, a0, 2
slli t1, a1, 30
or t0, t0, t1
li t2, 0x33333333
and t0, t0, t2
srli t1, a1, 2
and t1, t1, t2
and a0, a0, t2
and a1, a1, t2
add a0, a0, t0
add a1, a1, t1
# x = ((x >> 4) + x) & 0x0f0f0f0f
srli t0, a0, 4
slli t1, a1, 28
or t0, t0, t1
srli t1, a1, 4
add a0, a0, t0
add a1, a1, t1
li t0, 0x0f0f0f0f
and a0, a0, t0
and a1, a1, t0
# x += (x >> 8)
srli t0, a0, 8
slli t1, a1, 24
or t0, t0, t1
srli t1, a1, 8
add a0, a0, t0
add a1, a1, t1
# x += (x >> 16)
srli t0, a0, 16
slli t1, a1, 16
or t0, t0, t1
srli t1, a1, 16
add a0, a0, t0
add a1, a1, t1
# x += (x >> 32)
add a0, a0, a1
# return (64 - (x & 0x7f))
li t0, 0x7f
and a0, a0, t0
li t0, 64
sub a0, t0, a0
ret
# t0: low 32bit of bitmask
# t1: high 32bit of bitmask
# a0: low 32bit of input
# a1: high 32bit of input
printResult:
mv t2, a0
mv t3, a1
la a0, str1
li a7, 4
ecall
mv a0, t3
li a7, 35
ecall
la a0, str3
li a7, 4
ecall
mv a0, t2
li a7, 35
ecall
la a0, str2
li a7, 4
ecall
mv a0, t1
li a7, 35
ecall
la a0, str3
li a7, 4
ecall
mv a0, t0
li a7, 35
ecall
la a0, str4
li a7, 4
ecall
ret
```
Output
```
The reslut of 0b00000000000000000000000000000000 0b00000000000000000000000000000000 is 0b00000000000000000000000000000000 0b00000000000000000000000000000000
The reslut of 0b00000000000000000000000000000000 0b00000000000000000000000000000100 is 0b00000000000000000000000000000000 0b00000000000000000000000000000111
The reslut of 0b10000000000000000000000000000000 0b00000000000000000000000000000000 is 0b11111111111111111111111111111111 0b11111111111111111111111111111111
```

