contributed by <吳睿秉>
Be aware of Markdown headings.
Use color sparingly, only when necessary to differentiate from plain text. Prioritize content over formatting.
The function fabsf is used to obtain the absolute value of a floating-point number. This is achieved by applying the AND operation with the mask 0x7FFFFFFF
, which sets the sign bit to 0.
In this C program, the input floating point number's address is first cast to a uint32_t
type to enable bitwise operations. After this, the uint32_t
value is masked with 0x7FFFFFFF
to clear the sign bit. Finally, the result is cast back to the original floating point type, which gives the absolute value of the floating point number.
Don't paste code snip without comprehensive discussions.
There are three test cases, so the len is set to 3
in advance. and the process will print the input to the console before calling the function fabsf
. After the function fabsf
returns, the output will be stored in a0
and then moved to a1
. Next, the value in a1
will be compared to the expected answer in a2
. The console will print 1
if the output equals the expected answer.
register name | content |
---|---|
t0 | data start address (input) |
t1 | number of data |
t2 | data start address (expect output) |
t3 | hold the value in a0 during a syscall temporarily |
t4 | hold 0x7FFFFFFF temporarily |
a1 | fabsf output |
a2 | expect output |
Input: 0x3f000000 (0.5 in decimal)
Output: 0x3f000000 (0.5 in decimal)
Input: 0xc1420000 (-12.125 in decimal)
Output: 0x41420000 (12.125 in decimal)
Input: 0xf0700000 (-2.97106e+29 in decimal)
Output: 0x70700000 (2.97106e+29 in decimal)
Use fewer instructions.
The my_clz
function counts the leading zeros in a 32-bit digit. Starting from the leftmost digit and moving toward the right, it counts the zeros until a '1' appears. The function then returns the number of zeros in t4
register. Note that if the 32-bit digit is all zeros (i.e., 0), the leading zeros are considered undefined.
This C program simply counts the leading zeros in the input x
. It starts by setting the count
to 0
. Then, it iterates from left to right through all 32 bits. Using a left shift operation (1U << i
), for i=31
to 0
, it performs an AND operation with x
to check if the bit is 1. As soon as the first 1 appears from the left side, the loop breaks and returns count
, which stores the number of leading zeros.
The RISC-V program for counting leading zeros works as follows. There are four inputs. The program displays the 32-bit input before starting my_clz
. It uses a loop to accumulate the number of leading zeros until the t6
register holds a 1, meaning the condition (x & (1 << i))
is true. At this point, the loop stops, and the count is stored in t4
. The program automatically checks the answer, and if it's correct, it will print "correct."
register name | content |
---|---|
t0 | data start address (input) |
t1 | number of data |
t2 | data start address (expect output) |
t3 | hold the value in a0 during a syscall temporarily |
t4 | count |
t5 | i |
t6 | (x & (1 << i)) |
a1 | my_clz output |
a2 | expect output |
s0 | store the immediate value 32 to check if the result is undefined. |
Input: 0x0000002 (2 in decimal,
00000000 00000000 00000000 00000010
in binary)
Output: 30
Input: 0x000000FF (255 in decimal,
00000000 00000000 00000000 11111111
in binary)
Output: 24
Input: 0xFFFFFFFF (-1 in decimal,
11111111 11111111 11111111 11111111
in binary)
Output: 0
Input: 0x00000000 (0)
Output: undefined
The purpose of fp16_to_fp32 is to convert a 16-bit floating-point number to a 32-bit floating-point number. Due to the format differences between FP16 and FP32, certain transformations are required during the conversion process.
There are several steps involved. First, the input h
is shifted left by 16 bits into a 32-bit uint32_t
, where the upper bits hold the FP16 data and the lower bits remain zero.
Second, the sign
bit is extracted, and the nonsign
bits (i.e., exponent and mantissa) are separated.
The third step involves calculating the renormalization shift. The function my_clz
, as mentioned earlier, is used to count the leading zeros in the nonsign part (exponent and mantissa). The result is adjusted by subtracting 5 if it's greater than 5, otherwise it's set to 0. This shift value is used to normalize the mantissa by shifting it left and then right.
The fourth step handles the special cases for the exponent being all ones (infinity or NaN). In FP16, an exponent of 0x1F (31 in decimal) indicates either infinity or NaN. The result is right-shifted by 8 bits to move the exponent to the correct position in the 32-bit representation. The mask 0x7F800000
ensures that only the exponent bits are retained in the final result.
In the fifth step, the zero_mask is used to checks if nonsign
is zero. If so, equals 0, subtracting 1 causes overflow, setting bit 31 to 1. If nonsign is not zero, bit 31 remains 0.
Finally, the function combines the sign, the normalized nonsign, and applies the inf_nan_mask
and zero_mask
to handle cases of infinity
, NaN
, and zero
, ultimately returning the 32-bit floating-point value.
This RISC-V program executes the process described in the C program, utilizing the previously mentioned my_clz
function to convert from FP16 to FP32.
register name | content |
---|---|
s1 | h, w, renorm_shift |
s2 | sign |
s3 | nonsign |
s4 | inf_nan_mask |
s5 | zero_mask |
s6 | (nonsign << renorm_shift >> 3) |
s7 | ((0x70 - renorm_shift) << 23) |
s8 | result |
s11 | Temporarily store an immediate value for operations. |
Input: 0x3C00 (1 in fp16, 15360 in decimal on the console)
Output: 0x3F800000 (1 in fp32, 1065353216 in decimal on the console)
Input: 0xCC10 (-16.25 in fp16, 52240 in decimal on the console)
Output: 0xC1820000 (-16.25 in fp32, -1048444928 in decimal on the console)
Input: 0x962C (-0.001507 in fp16, 38444 in decimal on the console)
Output: 0xBAC58000 (-0.0015068054 in fp32, -1161461760 in decimal on the console)
You are given an array nums
, where each number in the array appears either once or twice.
Return the bitwise XOR
of all the numbers that appear twice in the array, or 0 if no number appears twice.
Input: nums = [1,2,2,1]
Output: 3
Explanation: Numbers 1 and 2 appeared twice.1 XOR 2 == 3
.
1 <= nums.length <= 50
1 <= nums[i] <= 50
nums
appears either once or twice.This problem can easily be solved with a nested loop that checks the values in the array. If a value repeats, perform the XOR operation to get the final answer. This approach is straightforward and easy to understand.
register name | content |
---|---|
s3 | nums[i] address |
s9 | nums[i] value |
s11 | i |
s4 | nums[j] address |
t2 | nums[j] value |
t1 | j |
Execution info | Value |
---|---|
Cycles | 3047 |
Instrs. retired | 1816 |
CPI | 1.68 |
IPC | 0.596 |
This problem can easily be solved with Naive Approach. While it’s easy to understand, it is far from being the most time-efficient solution. After studying Problem C in Quiz 1, I applied a similar concept by using bitwise shift and AND operations to efficiently maintain information.
This problem has the constraint that nums[i] < 50
. Instead of using a naive O(n²)
solution, a 64-bit variable (uint64_t) can be used to record whether each number has appeared. Since 64 bits exceed the range of 50, using this method is efficient and sufficient for this problem.
If a number appears for the first time, set the corresponding bit in the uint64_t using an OR operation (i.e., set |= (1ULL << nums[i])
at line 8). To check if the number has appeared a second time, use an AND operation(i.e., set & (1ULL << nums[i])
at line 6). If it has, then perform the XOR operation.
In the C program, the XOR result is stored in a uint64_t variable. However, in RISC-V, the registers are only 32 bits wide. If we use only one 32-bit register (s7
register in Wrong version of dupNumXOR function)to store the result, it could lead to incorrect answers.
This scenario is similar to the C code, where changing uint64_t set = 0ULL;
to uint32_t set = 0ULL;
in line 3 would also lead to an incorrect result.
In this case, if nums[i] > 31
, the information from (1ULL << nums[i])
cannot be stored in a single 32-bit register.
To correctly store a 64-bit unsigned integer in RISC-V, we need to utilize two 32-bit registers(in RISC-V Corrected version, register s7
and s8
).
Execution info | Value |
---|---|
Cycles | 754 |
Instrs. retired | 528 |
CPI | 1.43 |
IPC | 0.7 |
Eventually, after slightly rearranging the registers in the checks for upper and lower bits and improving the branch conditions, I revised the code, reducing the number of instructions and cycles.
register name | content |
---|---|
t4 | immediate 32 for comparision, (nums[i] - 31) |
t5 | const 1 |
t6 | hold the value in a0 during a syscall temporarily |
a1 | array length (numsSize) |
s1 | number of dataset |
s2 | length of every dataset base address |
s3 | dataset base address |
s4 | expect base address |
s5 | currently doing x-th in a dataset |
s6 | xor result |
s7 | set_A (the left half bits) |
s8 | set_B (the right half bits) |
s9 | nums[i] |
s10 | (set & (1ULL << nums[i])) |
s11 | i |
Execution info | Value |
---|---|
Cycles | 683 |
Instrs. retired | 463 |
CPI | 1.48 |
IPC | 0.678 |
Input: nums = [1, 2, 1, 3]
Output: 1
Input: nums = [1, 2, 3, 4, 6, 7, 8, 2, 9, 9, 3, 5, 1, 5]
Output: 12
Input: nums = [3, 49, 49, 3, 8, 12, 19, 19, 28, 27, 28, 50, 31, 36, 50, 36]
Output: 43
In the Ripes simulator, the CPU is organized as a classic five-stage pipeline, with pipeline registers between each stage. An instruction will sequentially go through these stages, allowing for different instructions to overlap in execution. For instance, consider the instruction lw a1, 0(s2)
, which retrieves the starting address of the size
array and stores it in the a1 register. This example illustrates how the pipeline operates in the following five stages:
In RISC-V, the register a1
is represented as x11
, and the register s2
is represented as x18
. Therefore, the instruction becomes lw x11, 0 x18
, as shown below, which means loading a word from the address specified by s2
(in x18
) into a1
(in x11
). The lw
instruction format is as follows:
lw rd, offset(rs1)
31 ~ 20 | 19 ~ 15 | 14 ~ 12 | 11 ~ 7 | 6 ~ 0 |
---|---|---|---|---|
imm [11:0] | rs1 | funct3 | rd | opcode |
0 | x18 | x11 | ||
000000000000 | 10010 | 010 | 01011 | 0000011 |
According to the Program Counter (PC), the instruction address in instruction memory is 0x00000018
, which retrieves the instruction 00092583
. The PC will then load the address of the next instruction, which is the current instruction address plus 4 bytes (i.e., 0x0000001C).
The instruction is decoded from binary according to the above table, and control signals are generated to determine what the instruction is supposed to do.
Additionally, this stage reads the values from the source registers in the register file. In this case, rs1
, which corresponds to s2
, contains the starting address of the size
array in the data section. The starting address is exactly 0x10000004
, and the offset immediate is 0
.
In memory, the orange area indicates the address and the words of the size
array that are stored. The starting address is 0x10000004
, with space for 3 words. The corresponding data is 4, 14, 16
, as previously defined.
Arithmetic or logical operations are performed in this stage. The address calculation uses the base address from the source register s2
and the offset (which is 0
). The multiplexer selects the register value from the previous step (i.e., the ID stage). The ALU adds these two values together, resulting in an effective address that is the same as the value stored in s2
.
For memory-related instructions, this stage handles reading from or writing to memory. In this stage, it retrieves the word stored at the address specified by the offset in data memory. As a result, the first word in the size
array, which is 4
, will be retrieved.
The word from data memory is written back to the register file. The result 4
, is stored in the destination register a1
(0b
).
Always refer to primary sources, such as official RISC-V documentation.