Implenet priority encoder using CLZ

contributed by ollieni

Introduction

A priority encoder is a circuit or algorithm used to efficiently reduce multiple binary inputs into a smaller set of outputs.

If two or more inputs are simultaneously applied to the priority encoder, the input with the highest priorit will be output first.

Here is the input and the corresponding output of 4-to-2 priority encoder.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Motivation

The priority encoder can be implemented by using some logic gate.

For example:
A 4-to-2 priority encoder can be simply implemented using two logic gate.
A1 = Y3 + Y2
A0 = Y3 + Y1

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

But when input increase, it become overly complicated.
Futhermore, general priority encoder can't handle various input size.

So I decide to adopt CLZ(counting leading zeros) method to address these two problem.

Method

Since CLZ can efficiently help us to identify the first non-zero bits, which corresponds to the highest-priority input in priority encoder.
We can use the size of input to substract leading zeros to obtain the position of the first active (high) bit in a binary number.

For example, we have a four-bits input "0010", using CLZ will return 2. Then we use input_size 4 to substract 2, note that we still need to additionally subtract 1 since the priority is start at 0.
Then we will get 1, which is "01" in binary, as the priority of "0010".

Using this method, the two problems that I mentioned in motivation could be addressed.

Implemention

C code
















































#include <stdio.h>
#include <stdint.h>

uint16_t count_leading_zeros(uint64_t x, int input_size)
{
    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);
    x |= (x >> 32);

    /* count ones (population count) */
    x -= ((x >> 1) & 0x5555555555555555 );
    x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333 );
    x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
    x += (x >> 8);
    x += (x >> 16);
    x += (x >> 32);

    return (input_size - (x & 0x7f));
}

int priority_encoder_4bit(int leading_zero, int input_size) {
    if(leading_zero<0){
        printf("wrong input");
        return 0;
    }
    else{
        return (input_size -1 -(leading_zero));
    }
}

int main() {
    int input;
    int input_size;
    int leading_zero = 0;
    int priority = 0;
    printf("Enter a  number : ");
    scanf("%4d", &input);
    printf("Enter a  input size : ");
    scanf("%4d", &input_size);
    leading_zero = count_leading_zeros(input,input_size);
    printf("%d\n",leading_zero);
    priority = priority_encoder_4bit(leading_zero, input_size);
    printf("%d\n",priority);
    return 0;
}

RISC-V code































































































.data:
    nextline: .string "\n"
    input_size: .word 4
    data_1: .word 0b00000000000000000000000000000001
    data_2: .word 0b00000000000000000000000000000011
    data_3: .word 0b00000000000000000000000000000100
    data_4: .word 0b00000000000000000000000000001001
    
main:
    addi sp, sp, -20
    
    # push four pointers of test data onto the stack
    lw t0, data_1
    sw t0, 0(sp)
    lw t0, data_2
    sw t0, 4(sp)
    lw t0, data_3
    sw t0, 8(sp)
    lw t0, data_4
    sw t0, 12(sp)
    
    addi s0, x0, 4    # s0 is the iteration times(4 test case)
    addi s1, x0, 0    # s1 is counter
    addi s2, sp, 0    # s2 initial at data_1

loop:
    lw a0, 0(s2)        #load data into a0
    addi s2, s2, 4      # s2 move to next data
    addi s1, s1, 1      # counter++
clz:
    # Load the constants A01, A02, and mask 0x0f0f0f0f
    lui a5, 0x55555
    ori a5, a5, 0x555
    lui a6, 0x33333
    ori a6, a6, 0x333
    lui a7, 0x0f0f0
    ori a7, a7, 0x788
    addi a7, a7, 0x787

    # OR the input value with itself shifted to the right by 1, 2, 4, 8, 16, and 16 bits (total 32).
    srli a1, a0, 1   
    or a0, a0, a1 
    srli a1, a0, 2   
    or a0, a0, a1 
    srli a1, a0, 4   
    or a0, a0, a1 
    srli a1, a0, 8   
    or a0, a0, a1 
    srli a1, a0, 16  
    or a0, a0, a1 
    srli a1, a0, 16  
    srli a2, a1, 16  
    or a0, a0, a2 
  
    # Count ones (population count).
    srli a1, a0, 1
    and a4, a1, a5  
    sub a0, a0, a4
    and a1, a0, a6
    srli a2, a0, 2
    and a3, a6, a2
    add a0, a3, a1
    srli a1, a0, 4
    add a0, a1, a0
    and a0, a0, a7
    srli a1, a0, 8
    add a0,a1,a0
    srli a1, a0, 16
    add a0, a1, a0
    srli a1, a0, 16
    srli a2, a1, 16
    add a0, a0, a2
  
    # Return the number of zeros.
    andi a0, a0, 0x7f # andi a0, a0, 0x7f
    addi a1, x0, input_size
    sub a0, a1, a0
    
    #count the priority
    addi a1, a1, -1
    sub a0, a1, a0
  
print:
    li a7, 35
    ecall
    lw a0, nextline
    li, a7, 11
    ecall
    bne s1, s0, loop
    beq s1, s0, end
    
end:  
    li a7, 10    # Exit system call
    ecall

Analysis

We test our code with Ripes simulator.
Here are the first few lines of the executable code:



































00000016 <main>:
    16:        fec10113        addi x2 x2 -20
    1a:        00000297        auipc x5 0x0 <nextline>
    1e:        fec2a283        lw x5 -20 x5
    22:        00512023        sw x5 0 x2
    26:        00000297        auipc x5 0x0 <nextline>
    2a:        fe42a283        lw x5 -28 x5
    2e:        00512223        sw x5 4 x2
    32:        00000297        auipc x5 0x0 <nextline>
    36:        fdc2a283        lw x5 -36 x5
    3a:        00512423        sw x5 8 x2
    3e:        00000297        auipc x5 0x0 <nextline>
    42:        fd42a283        lw x5 -44 x5
    46:        00512623        sw x5 12 x2
    4a:        00400413        addi x8 x0 4
    4e:        00000493        addi x9 x0 0
    52:        00010913        addi x18 x2 0

00000056 <loop>:
    56:        00092503        lw x10 0 x18
    5a:        00490913        addi x18 x18 4
    5e:        00148493        addi x9 x9 1

00000062 <clz>:
    62:        555557b7        lui x15 0x55555
    66:        5557e793        ori x15 x15 1365
    6a:        33333837        lui x16 0x33333
    6e:        33386813        ori x16 x16 819
    72:        0f0f08b7        lui x17 0xf0f0
    76:        7888e893        ori x17 x17 1928
    7a:        78788893        addi x17 x17 1927
    7e:        00155593        srli x11 x10 1
    82:        00b56533        or x10 x10 x11
    86:        00255593        srli x11 x10 2
    8a:        00b56533        or x10 x10 x11

5-stage pipelined processor

We choose 5-stage processor to run our code.
Below is the processor diagram.

Next I would analyze how the instruction go through

Instruction fetch (IF)
Instruction decode and register fetch (ID)
Execute (EX)
Memory access (MEM)
Register write back (WB)

I would take addi x2, x2, -20 as instance:

IF stage

The address is now at 0x00000016 since the instruction is put at 0x00000016.
PC will increment by 4 since no branch occur, so PC will become 0x0000001a.
We can know the machine of the instruction is 0xfec10113 from instr. memory.

ID stage

The instruction is decoded from0xfec10113 to ADDI.
And we can see the result will be store at 0x02, which is x2.

EX stage

ALU recieve a value 0xffffffec, which is -20 in the instruction.
And 0x02(x2) and 0xffffffec is passed to the next stage to execute.

MEM stage

The data in is 0x00000000 and register 0x02 is passed to the final stage.

WB stage

The register 0x02 is writed and pushed out, then the instruction is done.

Enhancement

Instruction refinement
First, I found that I mistakenly put the initialization of masks in loop.

Then I put it out of loop to avoid repeating initialization.
Before