Try   HackMD

Implenet priority encoder using CLZ

contributed by ollieni

Introduction

A priority encoder is a circuit or algorithm used to efficiently reduce multiple binary inputs into a smaller set of outputs.

If two or more inputs are simultaneously applied to the priority encoder, the input with the highest priorit will be output first.

Here is the input and the corresponding output of 4-to-2 priority encoder.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Motivation

The priority encoder can be implemented by using some logic gate.

For example:
A 4-to-2 priority encoder can be simply implemented using two logic gate.
A1 = Y3 + Y2
A0 = Y3 + Y1

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

But when input increase, it become overly complicated.
Futhermore, general priority encoder can't handle various input size.

So I decide to adopt CLZ(counting leading zeros) method to address these two problem.

Method

Since CLZ can efficiently help us to identify the first non-zero bits, which corresponds to the highest-priority input in priority encoder.
We can use the size of input to substract leading zeros to obtain the position of the first active (high) bit in a binary number.

For example, we have a four-bits input "0010", using CLZ will return 2. Then we use input_size 4 to substract 2, note that we still need to additionally subtract 1 since the priority is start at 0.
Then we will get 1, which is "01" in binary, as the priority of "0010".

Using this method, the two problems that I mentioned in motivation could be addressed.

Implemention

  • C code
#include <stdio.h> #include <stdint.h> uint16_t count_leading_zeros(uint64_t x, int input_size) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & 0x5555555555555555 ); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333 ); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (input_size - (x & 0x7f)); } int priority_encoder_4bit(int leading_zero, int input_size) { if(leading_zero<0){ printf("wrong input"); return 0; } else{ return (input_size -1 -(leading_zero)); } } int main() { int input; int input_size; int leading_zero = 0; int priority = 0; printf("Enter a number : "); scanf("%4d", &input); printf("Enter a input size : "); scanf("%4d", &input_size); leading_zero = count_leading_zeros(input,input_size); printf("%d\n",leading_zero); priority = priority_encoder_4bit(leading_zero, input_size); printf("%d\n",priority); return 0; }
  • RISC-V code
.data: nextline: .string "\n" input_size: .word 4 data_1: .word 0b00000000000000000000000000000001 data_2: .word 0b00000000000000000000000000000011 data_3: .word 0b00000000000000000000000000000100 data_4: .word 0b00000000000000000000000000001001 main: addi sp, sp, -20 # push four pointers of test data onto the stack lw t0, data_1 sw t0, 0(sp) lw t0, data_2 sw t0, 4(sp) lw t0, data_3 sw t0, 8(sp) lw t0, data_4 sw t0, 12(sp) addi s0, x0, 4 # s0 is the iteration times(4 test case) addi s1, x0, 0 # s1 is counter addi s2, sp, 0 # s2 initial at data_1 loop: lw a0, 0(s2) #load data into a0 addi s2, s2, 4 # s2 move to next data addi s1, s1, 1 # counter++ clz: # Load the constants A01, A02, and mask 0x0f0f0f0f lui a5, 0x55555 ori a5, a5, 0x555 lui a6, 0x33333 ori a6, a6, 0x333 lui a7, 0x0f0f0 ori a7, a7, 0x788 addi a7, a7, 0x787 # OR the input value with itself shifted to the right by 1, 2, 4, 8, 16, and 16 bits (total 32). srli a1, a0, 1 or a0, a0, a1 srli a1, a0, 2 or a0, a0, a1 srli a1, a0, 4 or a0, a0, a1 srli a1, a0, 8 or a0, a0, a1 srli a1, a0, 16 or a0, a0, a1 srli a1, a0, 16 srli a2, a1, 16 or a0, a0, a2 # Count ones (population count). srli a1, a0, 1 and a4, a1, a5 sub a0, a0, a4 and a1, a0, a6 srli a2, a0, 2 and a3, a6, a2 add a0, a3, a1 srli a1, a0, 4 add a0, a1, a0 and a0, a0, a7 srli a1, a0, 8 add a0,a1,a0 srli a1, a0, 16 add a0, a1, a0 srli a1, a0, 16 srli a2, a1, 16 add a0, a0, a2 # Return the number of zeros. andi a0, a0, 0x7f # andi a0, a0, 0x7f addi a1, x0, input_size sub a0, a1, a0 #count the priority addi a1, a1, -1 sub a0, a1, a0 print: li a7, 35 ecall lw a0, nextline li, a7, 11 ecall bne s1, s0, loop beq s1, s0, end end: li a7, 10 # Exit system call ecall

Analysis

We test our code with Ripes simulator.
Here are the first few lines of the executable code:

00000016 <main>: 16: fec10113 addi x2 x2 -20 1a: 00000297 auipc x5 0x0 <nextline> 1e: fec2a283 lw x5 -20 x5 22: 00512023 sw x5 0 x2 26: 00000297 auipc x5 0x0 <nextline> 2a: fe42a283 lw x5 -28 x5 2e: 00512223 sw x5 4 x2 32: 00000297 auipc x5 0x0 <nextline> 36: fdc2a283 lw x5 -36 x5 3a: 00512423 sw x5 8 x2 3e: 00000297 auipc x5 0x0 <nextline> 42: fd42a283 lw x5 -44 x5 46: 00512623 sw x5 12 x2 4a: 00400413 addi x8 x0 4 4e: 00000493 addi x9 x0 0 52: 00010913 addi x18 x2 0 00000056 <loop>: 56: 00092503 lw x10 0 x18 5a: 00490913 addi x18 x18 4 5e: 00148493 addi x9 x9 1 00000062 <clz>: 62: 555557b7 lui x15 0x55555 66: 5557e793 ori x15 x15 1365 6a: 33333837 lui x16 0x33333 6e: 33386813 ori x16 x16 819 72: 0f0f08b7 lui x17 0xf0f0 76: 7888e893 ori x17 x17 1928 7a: 78788893 addi x17 x17 1927 7e: 00155593 srli x11 x10 1 82: 00b56533 or x10 x10 x11 86: 00255593 srli x11 x10 2 8a: 00b56533 or x10 x10 x11

5-stage pipelined processor

We choose 5-stage processor to run our code.
Below is the processor diagram.

Next I would analyze how the instruction go through

  1. Instruction fetch (IF)
  2. Instruction decode and register fetch (ID)
  3. Execute (EX)
  4. Memory access (MEM)
  5. Register write back (WB)

I would take addi x2, x2, -20 as instance:

IF stage

  • The address is now at 0x00000016 since the instruction is put at 0x00000016.
  • PC will increment by 4 since no branch occur, so PC will become 0x0000001a.
  • We can know the machine of the instruction is 0xfec10113 from instr. memory.

ID stage

  • The instruction is decoded from0xfec10113 to ADDI.
  • And we can see the result will be store at 0x02, which is x2.

EX stage

  • ALU recieve a value 0xffffffec, which is -20 in the instruction.
  • And 0x02(x2) and 0xffffffec is passed to the next stage to execute.

MEM stage

  • The data in is 0x00000000 and register 0x02 is passed to the final stage.

WB stage

  • The register 0x02 is writed and pushed out, then the instruction is done.

Enhancement

Instruction refinement
First, I found that I mistakenly put the initialization of masks in loop.

Then I put it out of loop to avoid repeating initialization.
Before

After

The cycle and instuction count has decreased after the optimization.


Then I continued reducing some redundant instructions.
Before

After

The result imply someone might not wrote the code properly at first, so code review is really important.