Try   HackMD

Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by <liballouo>

Problem C in Quiz 1

We test our code using Ripes simulator.

Version 1

C code

static inline int my_clz(uint32_t x) {
    int count = 0;
    for (int i = 31; i >= 0; --i) {
        if (x & (1U << i))
            break;
        count++;
    }
    return count;
}

RISC-V assembly code

You can see the entire code here.

# Input:  uint32_t a0
# Output: uint32_t a0
my_clz:
    li   t0, 0          # count = 0
    li   t1, 31         # t1 = 31
clz_loop:
    li   t2, 1 
    sll  t2, t2, t1    # (1U << i)
    and  t3, a0, t2    # x & (1U << i)
    bnez t3, clz_done # if (x & (1U << i)) break loop
    addi t0, t0, 1    # count = count + 1
    addi t1, t1, -1   # i = i - 1
    bgez t1, clz_loop # if (i>=0) continue loop
clz_done:
    addi a0, t0, 0    # set result to a0
    ret

execution info

Info Value
Cycle 443
Instrs. retired 322
CPI 1.38
IPC 0.727
Clock rate 0 Hz

Version 2

Counting leading zero utilizes population count.

C code

static inline int my_clz(uint32_t x)
{
    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);

    /* count ones (population count) */
    x -= ((x >> 1) & 0x55555555);
    x =  (x >> 2) & 0x33333333) + (x & 0x33333333);
    x =  ((x >> 4) + x) & 0x0f0f0f0f;
    x += (x >> 8);
    x += (x >> 16);

    return (32 - (x & 0x3f));
}

RISC-V assembly code

You can see the entire code here.

# Input:  uint32_t a0
# Output: uint32_t a0
my_clz:
    srli t0, a0, 1   # x >> 1
    or   a0, a0, t0  # x |= (x >> 1)
    srli t0, a0, 2   # x >> 2
    or   a0, a0, t0  # x |= (x >> 2)
    srli t0, a0, 4   # x >> 4
    or   a0, a0, t0  # x |= (x >> 4)
    srli t0, a0, 8   # x >> 8
    or   a0, a0, t0  # x |= (x >> 8)
    srli t0, a0, 16  # x >> 16
    or   a0, a0, t0  # x |= (x >> 16)
 
popcount:
    lui  t1, 0x55555
    ori  t1, t1, 0x555 # 0x55555555 (t1)
    srli t0, a0, 1   # x >> 1
    and  t2, t0, t1  # (x >> 1) & 0x55555555
    sub  a0, a0, t2  # x -= ((x >> 1) & 0x55555555)
    
    lui  t1, 0x33333
    ori  t1, t1, 0x333 # 0x33333333 (t1)
    srli t0, a0, 2   # x >> 2
    and  t2, t0, t1  # (x >> 2) & 0x33333333
    and  t3, a0, t1  # x & 0x33333333
    add  a0, t2, t3  # ((x >> 2) & 0x33333333) + (x & 0x33333333)
    
    lui  t1, 0x0f0f0
    ori  t1, t1, 0x70f
    addi t1, t1, 0x800 # 0x0f0f0f0f (t1)
    srli t0, a0, 4   # x >> 4
    add  t2, a0, t0  # (x >> 4) + x
    and  a0, t2, t1  # ((x >> 4) + x) & 0x0f0f0f0f
    
    srli t0, a0, 8   # x >> 8
    add  a0, a0, t0  # x += (x >> 8)
    
    srli t0, a0, 16  # x >> 16
    add  a0, a0, t0  # x += (x >> 16)
        
    andi t0, a0, 0x3f # x & 0x3f
    li   t1, 32
    sub  a0, t1, t0   # 32 - (x & 0x3f)
    
    ret

execution info

Info Value
Cycle 194
Instrs. retired 151
CPI 1.28
IPC 0.778
Clock rate 0 Hz

test cases

test case 1

Input: 0x3
Output: 30

test case 2

Input: 0xFFFFFFFF
Output: 0

test case 3

Input: 0x3321675
Output: 6

Biweekly Contest 141: Q1. Construct the Minimum Bitwise Array II

Q1. Construct the Minimum Bitwise Array II

Description

You are given an array nums consisting of n prime integers.
You need to construct an array ans of length n, such that, for each index i, the bitwise OR of ans[i] and ans[i] + 1 is equal to nums[i], i.e. ans[i] OR (ans[i] + 1) == nums[i].
Additionally, you must minimize each value of ans[i] in the resulting array.
If it is not possible to find such a value for ans[i] that satisfies the condition, then set ans[i] = -1.
A prime number is a natural number greater than 1 with only two factors, 1 and itself.

Constraints

  • 1 <= nums.length <= 100
  • 2 <= nums[i] <= 10^9
  • nums[i] is a prime number.

Solution

There are three cases to consider:

  1. Case1: If nums[i] = 2, then ans[i] = -1.
  2. Case 2: If nums[i] consists of all 1s in binary representation, then ans[i] = nums[i] >> 1.
  3. Case 3: For other numbers, find the rightmost 0 bit and change the 1 bit immediately to its right to 0.

Here's an example for Case 3:

input num = 19 (in decimal, 10011 in binary)
expect output = 17

1. find the rightmost 0 : 10011
                            ^  
   index i = 2
2. set i-1 to 0 : 10011 -> 10001
                     ^        ^
   answer = 10001 (binary) = 17 (decimal)
3. check the answer : 
   10001 ^ 10010 = 10011 = 19 (decimal) -> checked!

We implement it in C.
You can see the entire code here.
To handle different cases, the solution involves multiple functions:

minBitwiseNum Function

This function determines the appropriate value for ans based on the input num.

uint32_t minBitwiseNum(uint32_t num) {
    // Input:  num
    // Output: ans
    uint32_t ans;
    
    // case: 2
    if(num == 2){
        ans = -1;
    }
    // case: all 1s (2^n-1)
    else if(ilog2(num) + 1 == popcount(num)){
        ans = num >> 1;
    }
    // case: others
    else{
        ans = helper(num);
    }
    
    return ans;
}
  • Case 1: This handles the special case where num equals 2, for which there is no valid solution for ans, so the function returns -1.
  • Case 2: This checks if num has all 1s in its binary representation (e.g., 3, 7, 15). The condition uses the ilog2 function to get the number of bits required and popcount to check if all bits are set.
  • Case 3: When the number doesn't fall under the previous two cases, the helper function finds the correct value.

ilog2 Function

This function calculates the integer logarithm base 2 of a number using the my_clz function to count leading zeros.

int ilog2(uint32_t num){
    return 31 - my_clz(num);
}

my_clz Function

The my_clz function counts the number of leading zeros in a 32-bit unsigned integer. It uses bitwise operations to set all bits to 1 to the right of the highest set bit and then uses popcount to calculate the number of 1s. Last, it returns the value of leading zero.

int my_clz(uint32_t x)
{
    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);

    x = popcount(x);

    return (32 - (x & 0x3f));
}

popcount Function

The popcount function counts the number of 1s in the binary representation of a number. It uses bit manipulation techniques to sum the bits in groups, eventually reducing to a count of all 1s.

int popcount(uint32_t x){
    /* count ones (population count) */
    x -= ((x >> 1) & 0x55555555);
    x =  ((x >> 2) & 0x33333333) + (x & 0x33333333);
    x =  ((x >> 4) + x) & 0x0f0f0f0f;
    x += (x >> 8);
    x += (x >> 16);
    
    return x;
}

helper Function

The helper function is used to find the minimum possible value for ans when the number does not fit into the special cases. It looks for the rightmost 0 bit and clears the bit immediately to its left.

uint32_t helper(uint32_t num) {
    for(int i=1; i<32; i++){
        // Find the rightmost 0
        if(!((1 << i) & num)){
            // Change the 1 on the right of the rightmost 0 to 0
            num ^= (1 << (i - 1));
            break;
        }
    }
    return num;
}

After finishing the C code, we translate it from C code to a RISC-V assembly program.
You can see the entire code here.

RISC-V assembly code

# Input:  uint32_t a0
# Output: uint32_t a0
# t6: flag, 0 for ilog ; 1 for popcount

min_bitwise_num:
    addi sp, sp, -4
    sw   ra, 0(sp)
    li   t0, 2
    beq  a0, t0, case_2      # case: num == 2
    
    addi a1, a0, 0
    li   t6, 0               # flag for ilog / popcount
    jal  ra, ilog2           # do ilog2(num) and save the result in a2
    addi a0, a1, 0
    addi a2, a2, 1           # ilog2(num) + 1
    li   t6, 1               # flag for ilog / popcount
    jal  ra, popcount        # do popcount(num) and save the result in a3
    addi a0, a1, 0
    beq  a2, a3, case_all_1s # case: all 1s
    
    jal ra, helper           # case: others

case_2:
    li   a0, -1              # no possible value
    lw   ra, 0(sp)
    addi sp, sp, 4
    ret                      

my_clz:
    srli t0, a0, 1   # x >> 1
    or   a0, a0, t0  # x |= (x >> 1)
    srli t0, a0, 2   # x >> 2
    or   a0, a0, t0  # x |= (x >> 2)
    srli t0, a0, 4   # x >> 4
    or   a0, a0, t0  # x |= (x >> 4)
    srli t0, a0, 8   # x >> 8
    or   a0, a0, t0  # x |= (x >> 8)
    srli t0, a0, 16  # x >> 16
    or   a0, a0, t0  # x |= (x >> 16)
 
popcount:
    lui  t1, 0x55555
    ori  t1, t1, 0x555 # 0x55555555 (t1)
    srli t0, a0, 1   # x >> 1
    and  t2, t0, t1  # (x >> 1) & 0x55555555
    sub  a0, a0, t2  # x -= ((x >> 1) & 0x55555555)
    
    lui  t1, 0x33333
    ori  t1, t1, 0x333 # 0x33333333 (t1)
    srli t0, a0, 2   # x >> 2
    and  t2, t0, t1  # (x >> 2) & 0x33333333
    and  t3, a0, t1  # x & 0x33333333
    add  a0, t2, t3  # ((x >> 2) & 0x33333333) + (x & 0x33333333)
    
    lui  t1, 0x0f0f0
    ori  t1, t1, 0x70f
    addi t1, t1, 0x800 # 0x0f0f0f0f (t1)
    srli t0, a0, 4   # x >> 4
    add  t2, a0, t0  # (x >> 4) + x
    and  a0, t2, t1  # ((x >> 4) + x) & 0x0f0f0f0f
    
    srli t0, a0, 8   # x >> 8
    add  a0, a0, t0  # x += (x >> 8)
    
    srli t0, a0, 16  # x >> 16
    add  a0, a0, t0  # x += (x >> 16)
    beqz t6, clz_done 
    addi a3, a0, 0   # save the result in a3
    ret
    
clz_done:
    andi t0, a0, 0x3f # x & 0x3f
    li   t1, 32
    sub  a0, t1, t0  # 32 - (x & 0x3f)
    
    ret

ilog2:
    addi sp, sp, -4
    sw   ra, 0(sp)
    jal  ra  my_clz  # my_clz(num) and save the result in a0
    li   t2, 31
    sub  a2, t2, a0  # 31 - my_clz(num)
    lw   ra, 0(sp)
    addi sp, sp, 4
    ret

case_all_1s:
    srli a0, a0, 1   # ans = num >> 1
    lw   ra, 0(sp)
    addi sp, sp, 4
    ret

helper:
    li   t0, 32      # upper bound t0 = 32
    li   t1, 1       # loop start from 1, i(t1)
        
helper_loop:
    li   t2, 1       # t2 = 1
    sll  t2, t2, t1  # 1 << i 
    and  t3, a0, t2  # (1 << i) & num
    beqz t3, helper_done # correctly find the rightmost 0
    addi t1, t1, 1   # i += 1
    bne  t0, t1, helper_loop # do not find the rightmost 0 yet, continue the loop

helper_done:
    srli t2, t2, 1   # 1 << (i - 1)
    xor  a0, a0, t2  # num ^= (1 << (i - 1))
    lw   ra, 0(sp)
    addi sp, sp, 4
        
    ret 

Result

test cases

test case 1

Input: 2
Output: -1

test case 2

Input: 3
Output: 1

test case 3

Input: 5
Output: 4

test case 4

Input: 7
Output: 3

test case 5

Input: 31
Output: 15

test case 6

Input: 307
Output: 305

test case 7

Input: 383
Output: 319

test case 8

Input: 5039
Output: 5031

execution info

C code
Info Value
Cycle 6522
Instrs. retired 5066
CPI 1.26
IPC 0.777
Clock rate 0 Hz
RISC-V assembly code
Info Value
Cycle 1056
Instrs. retired 806
CPI 1.31
IPC 0.763
Clock rate 0 Hz

Analysis

pseudo instruction

The pseudo instruction below is popcount part of the code run on Ripes.

000000cc <popcount>:
    cc:        55555337        lui x6 0x55555
    d0:        55536313        ori x6 x6 1365
    d4:        00155293        srli x5 x10 1
    d8:        0062f3b3        and x7 x5 x6
    dc:        40750533        sub x10 x10 x7
    e0:        33333337        lui x6 0x33333
    e4:        33336313        ori x6 x6 819
    e8:        00255293        srli x5 x10 2
    ec:        0062f3b3        and x7 x5 x6
    f0:        00657e33        and x28 x10 x6
    f4:        01c38533        add x10 x7 x28
    f8:        0f0f0337        lui x6 0xf0f0
    fc:        70f36313        ori x6 x6 1807
    100:        80030313        addi x6 x6 -2048
    104:        00455293        srli x5 x10 4
    108:        005503b3        add x7 x10 x5
    10c:        0063f533        and x10 x7 x6
    110:        00855293        srli x5 x10 8
    114:        00550533        add x10 x10 x5
    118:        01055293        srli x5 x10 16
    11c:        00550533        add x10 x10 x5
    120:        000f8663        beq x31 x0 12 <clz_done>
    124:        00050693        addi x13 x10 0
    128:        00008067        jalr x0 x1 0

5-stage pipelined processor

We choose 5-stage pipelined processor as the processor we use in Ripes to run the code. Its block diagram look like this:

{19B27C82-BEEA-48B8-971F-975848B1E2B8}
The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are:

  1. IF: Instruction fetch
  2. ID: Instruction decode and register fetch
  3. EX: Execute
  4. MEM: Memory access
  5. WB: Register write back

To illustrate, we take the instruction lui t1, 0x55555 as an example. According to The RISC-V Instruction Set Manual

LUI (load upper immediate) is used to build 32-bit constants and uses the U-type format. LUI places the U-immediate value in the top 20 bits of the destination register rd, filling in the lowest 12 bits with zeros.

U-type format instruction

The layout of the U-format instructions is:

imm [31:12] rd [11:7] opcode [6:0]

The pseudo instruction is lui x6 0x55555 which decoded into 55555337.

imm [31:12] rd [11:7] opcode [6:0] representation
01010101010101010101 00110 0110111 binary
0x55555 0x6 0x37 heximal

1. IF: Instruction fetch

{DDF6555A-3CA9-4180-B899-0E9DD8A8F7E7}

  • The PC starts at 0x000000cc. PC+4 will pass to the following stages for further usages.
  • We fetch the instruction at 0x000000cc from instruction memory, resulting in the instruction 0x55555337.

2. ID: Instruction decode and register fetch

{470862F3-ECD1-4D9C-971B-35B808B3FD54}

  • The instruction 0x55555337 is decoded, revealing lui x6, 0x55555.
    • opcode = LUI
    • Imm. = 0x55555000
    • Wr idx = 0x06
  • Although U-type instructions do not require source registers, the processor still fetches the source register indices from the instruction, resulting in rs1 = 0x0a and rs2 = 0x15. These fields are irrelevant for this instruction.
    ​​​​31                25 24          20 19          15 14    12 ...
    ​​​​0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  0  1  ...
    ​​​​                     | rs2 = 0x15|  | rs1 = 0x0a|
    ​​​​                     -------------  -------------
    
  • The values of registers rs1 and rs2 are fetched, resulting in Reg 1 = 0x00000003 and Reg 2 = 0x00000000, but they are not used in this instruction.
  • The PC and PC+4 are passed unchanged to the next stage.

EX: Execute

{178DB876-7955-48AB-B299-1BF518A088BF}

  • We do not use the PC+4, so we just let it pass.
  • We do not use Reg 1 and Reg 2 in U-type format instructions, so after their value passes the multiplexers, they are filtered. The result of ALU is just 0x55555000 from Imm..
  • There is no branch taken in this stage.
  • rd is not used in this stage, so it just passes to the next stage.

MEM: Memory access

{810F0247-C207-4ADC-BFF9-706850186FE2}

  • The PC+4 is still passed to the next stage.
  • Memory access is not required for U-type instructions, so there is no memory read or write. Consequently, the "read out" of data memory will be 0x00000000, indicating that no memory access occurred.

WB: Register write back

{B3B0AC7B-504A-4F8B-86DD-C02378FE7460}

  • We pass the rd = 0x06 and imm = 0x55555000 to Wr idx and Wr data respectively.

{C9F14B02-8287-4F7E-81B9-143DCAEB338C}
{F4D25E68-C6DB-4E17-A640-B9088B44F757}

After completing these five stages, the register x6(t1) is set to 0x55555000.

{0AD3DF49-0B60-49E9-9EC8-6DABFA12C7EC}

Reference

The RISC-V Instruction Set Manual
Quiz1 of Computer Architecture (2023 Fall)
Quiz1 of Computer Architecture (2024 Fall)