contributed by ollieni
A priority encoder is a circuit or algorithm used to efficiently reduce multiple binary inputs into a smaller set of outputs.
If two or more inputs are simultaneously applied to the priority encoder, the input with the highest priorit will be output first.
Here is the input and the corresponding output of 4-to-2 priority encoder.
The priority encoder can be implemented by using some logic gate.
For example:
A 4-to-2 priority encoder can be simply implemented using two logic gate.
A1 = Y3 + Y2
A0 = Y3 + Y1
But when input increase, it become overly complicated.
Futhermore, general priority encoder can't handle various input size.
So I decide to adopt CLZ(counting leading zeros) method to address these two problem.
Since CLZ can efficiently help us to identify the first non-zero bits, which corresponds to the highest-priority input in priority encoder.
We can use the size of input to substract leading zeros to obtain the position of the first active (high) bit in a binary number.
For example, we have a four-bits input "0010", using CLZ will return 2. Then we use input_size 4 to substract 2, note that we still need to additionally subtract 1 since the priority is start at 0.
Then we will get 1, which is "01" in binary, as the priority of "0010".
Using this method, the two problems that I mentioned in motivation could be addressed.
#include <stdio.h>
#include <stdint.h>
uint16_t count_leading_zeros(uint64_t x, int input_size)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555 );
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333 );
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (input_size - (x & 0x7f));
}
int priority_encoder_4bit(int leading_zero, int input_size) {
if(leading_zero<0){
printf("wrong input");
return 0;
}
else{
return (input_size -1 -(leading_zero));
}
}
int main() {
int input;
int input_size;
int leading_zero = 0;
int priority = 0;
printf("Enter a number : ");
scanf("%4d", &input);
printf("Enter a input size : ");
scanf("%4d", &input_size);
leading_zero = count_leading_zeros(input,input_size);
printf("%d\n",leading_zero);
priority = priority_encoder_4bit(leading_zero, input_size);
printf("%d\n",priority);
return 0;
}
.data:
nextline: .string "\n"
input_size: .word 4
data_1: .word 0b00000000000000000000000000000001
data_2: .word 0b00000000000000000000000000000011
data_3: .word 0b00000000000000000000000000000100
data_4: .word 0b00000000000000000000000000001001
main:
addi sp, sp, -20
# push four pointers of test data onto the stack
lw t0, data_1
sw t0, 0(sp)
lw t0, data_2
sw t0, 4(sp)
lw t0, data_3
sw t0, 8(sp)
lw t0, data_4
sw t0, 12(sp)
addi s0, x0, 4 # s0 is the iteration times(4 test case)
addi s1, x0, 0 # s1 is counter
addi s2, sp, 0 # s2 initial at data_1
loop:
lw a0, 0(s2) #load data into a0
addi s2, s2, 4 # s2 move to next data
addi s1, s1, 1 # counter++
clz:
# Load the constants A01, A02, and mask 0x0f0f0f0f
lui a5, 0x55555
ori a5, a5, 0x555
lui a6, 0x33333
ori a6, a6, 0x333
lui a7, 0x0f0f0
ori a7, a7, 0x788
addi a7, a7, 0x787
# OR the input value with itself shifted to the right by 1, 2, 4, 8, 16, and 16 bits (total 32).
srli a1, a0, 1
or a0, a0, a1
srli a1, a0, 2
or a0, a0, a1
srli a1, a0, 4
or a0, a0, a1
srli a1, a0, 8
or a0, a0, a1
srli a1, a0, 16
or a0, a0, a1
srli a1, a0, 16
srli a2, a1, 16
or a0, a0, a2
# Count ones (population count).
srli a1, a0, 1
and a4, a1, a5
sub a0, a0, a4
and a1, a0, a6
srli a2, a0, 2
and a3, a6, a2
add a0, a3, a1
srli a1, a0, 4
add a0, a1, a0
and a0, a0, a7
srli a1, a0, 8
add a0,a1,a0
srli a1, a0, 16
add a0, a1, a0
srli a1, a0, 16
srli a2, a1, 16
add a0, a0, a2
# Return the number of zeros.
andi a0, a0, 0x7f # andi a0, a0, 0x7f
addi a1, x0, input_size
sub a0, a1, a0
#count the priority
addi a1, a1, -1
sub a0, a1, a0
print:
li a7, 35
ecall
lw a0, nextline
li, a7, 11
ecall
bne s1, s0, loop
beq s1, s0, end
end:
li a7, 10 # Exit system call
ecall
We test our code with Ripes simulator.
Here are the first few lines of the executable code:
00000016 <main>:
16: fec10113 addi x2 x2 -20
1a: 00000297 auipc x5 0x0 <nextline>
1e: fec2a283 lw x5 -20 x5
22: 00512023 sw x5 0 x2
26: 00000297 auipc x5 0x0 <nextline>
2a: fe42a283 lw x5 -28 x5
2e: 00512223 sw x5 4 x2
32: 00000297 auipc x5 0x0 <nextline>
36: fdc2a283 lw x5 -36 x5
3a: 00512423 sw x5 8 x2
3e: 00000297 auipc x5 0x0 <nextline>
42: fd42a283 lw x5 -44 x5
46: 00512623 sw x5 12 x2
4a: 00400413 addi x8 x0 4
4e: 00000493 addi x9 x0 0
52: 00010913 addi x18 x2 0
00000056 <loop>:
56: 00092503 lw x10 0 x18
5a: 00490913 addi x18 x18 4
5e: 00148493 addi x9 x9 1
00000062 <clz>:
62: 555557b7 lui x15 0x55555
66: 5557e793 ori x15 x15 1365
6a: 33333837 lui x16 0x33333
6e: 33386813 ori x16 x16 819
72: 0f0f08b7 lui x17 0xf0f0
76: 7888e893 ori x17 x17 1928
7a: 78788893 addi x17 x17 1927
7e: 00155593 srli x11 x10 1
82: 00b56533 or x10 x10 x11
86: 00255593 srli x11 x10 2
8a: 00b56533 or x10 x10 x11
We choose 5-stage processor to run our code.
Below is the processor diagram.
Next I would analyze how the instruction go through
I would take addi x2, x2, -20
as instance:
IF stage
0x00000016
since the instruction is put at 0x00000016
.0x0000001a
.0xfec10113
from instr. memory.ID stage
0xfec10113
to ADDI.0x02
, which is x2.EX stage
0xffffffec
, which is -20 in the instruction.0x02
(x2) and 0xffffffec
is passed to the next stage to execute.MEM stage
0x00000000
and register 0x02
is passed to the final stage.WB stage
0x02
is writed and pushed out, then the instruction is done.Instruction refinement
First, I found that I mistakenly put the initialization of masks in loop.
Then I put it out of loop to avoid repeating initialization.
Before
After
The cycle and instuction count has decreased after the optimization.
Then I continued reducing some redundant instructions.
Before
After
The result imply someone might not wrote the code properly at first, so code review is really important.