# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [keywind127](https://github.com/keywind127/) >
###### tags: `RISC-V` `computer architecture 2024`
---
## 1. Problem Statement
In this assignment, we are tasked with translating one of the C programs from [Quiz 1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) (either A, B, or C) into [RISC-V](https://en.wikipedia.org/wiki/RISC-V) assembly language. I have chosen to work on ==Problem C==, as it offers a broad range of potential applications, particularly in the area of bitwise operations such as ["counting leading zeros" (CLZ)](https://en.wikipedia.org/wiki/Leading_zero). This function may prove useful for addressing similar problems in future bit-level programming tasks.
### 1.1 `fabsf()` - Absolute Function for Floating Numbers
Problem C requires the implementation of three C programs, starting with the ==absolute value function for floating-point numbers==, `fabsf()`. According to [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) standards, the bit structure of floating-point numbers is organized in a specific format, as illustrated below (**image source**: [GeeksForGeeks](https://www.geeksforgeeks.org/ieee-standard-754-floating-point-numbers/)).

The key distinction between positive and negative numbers lies in the [Most Significant Bit (MSB)](https://en.wikipedia.org/wiki/Bit_numbering), which represents the sign: a 0 for positive and a 1 for negative. The simplest way to convert any number to its positive counterpart is to set the [MSB](https://en.wikipedia.org/wiki/Bit_numbering) to 0, regardless of the values of the other bits. This concept aligns with the C code provided below. The details of implementing this logic in [RISC-V](https://en.wikipedia.org/wiki/RISC-V) assembly will be examined later in the report.
```c=
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x;
i &= 0x7FFFFFFF;
x = *(float *)&i;
return x;
}
```
### 1.2 `my_clz()` - Counting the Number of Leading Zeros
The `my_clz()` function is designed specifically to ==count the number of leading zeros== in a given number. This functionality is so frequently utilized that it is included as a built-in function in both [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) and [G++](https://en.wikipedia.org/wiki/GNU_Compiler_Collection).
The following C program provides a straightforward (*but naive*) implementation of [clz](https://en.wikipedia.org/wiki/Leading_zero). It iterates from the [Most Significant Bit (MSB)](https://en.wikipedia.org/wiki/Bit_numbering) to the [Least Significant Bit (LSB)](https://en.wikipedia.org/wiki/Bit_numbering), incrementing the count and halting as soon as a 1 bit is encountered.
```c=
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
### 1.3 `fp16_to_fp32()` - Converting Half to Single-Precision Floating Point Numbers
In many fields, particularly [deep learning](https://en.wikipedia.org/wiki/Deep_learning), where [artificial neural networks](https://en.wikipedia.org/wiki/Neural_network_(machine_learning)) are employed, model size often becomes a [bottleneck](https://en.wikipedia.org/wiki/Bottleneck_(software)), leading to slower [training and inference](https://blogs.nvidia.com/blog/difference-deep-learning-training-inference-ai/) due to limited [V-RAM](https://en.wikipedia.org/wiki/Video_random-access_memory). One way to alleviate this issue is by using [half-precision floating-point numbers (fp16)](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) instead of [single-precision (fp32)](https://en.wikipedia.org/wiki/Single-precision_floating-point_format). Although half-precision sacrifices some accuracy, the potential speedup may justify the tradeoff.
The purpose of this function is to convert a floating-point number from [half-precision (fp16)](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) back to [single-precision (fp32)](https://en.wikipedia.org/wiki/Single-precision_floating-point_format). According to the [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) standard (**image source**: [MindSpore](https://www.mindspore.cn/tutorials/zh-CN/br_base/beginner/mixed_precision.html)), [fp16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) uses 5 exponent bits, compared to the 8 bits used in [fp32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), and has only 10 [mantissa bits](https://www.geeksforgeeks.org/introduction-of-floating-point-representation/), whereas [fp32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) uses 23 bits.

The following C program implements the `fp16_to_fp32()` function. We will later discuss its translation into [RISC-V asembly](https://en.wikipedia.org/wiki/RISC-V) and examine potential [optimizations](https://en.wikipedia.org/wiki/Program_optimization) that can be leveraged.
```c=
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
---
## 2. RISC-V Implementations and Optimizations
### 2.1 `fabsf()` - Absolute Function for Floating Numbers
According to the [guidelines](https://hackmd.io/@sysprog/2024-arch-homework1) set by the professor, we are prohibited from using [RISC-V instructions with F extensions](https://en.wikipedia.org/wiki/RISC-V). Consequently, the C program must be converted into assembly code under the assumption that we are manipulating the binary data directly.
```c=
static inline uint32_t fabsf(uint32_t x) {
return x & 0x7FFFFFFF;
}
```
Assuming that the variable `x` is stored in register `a0`, we apply a 32-bit [mask](https://en.wikipedia.org/wiki/Mask_(computing)) to clear the [sign bit](https://en.wikipedia.org/wiki/Sign_bit) by forcing it to zero. The following [RISC-V](https://en.wikipedia.org/wiki/RISC-V) assembly code represents a direct translation of the C program into assembly.
```asm=
fabsf:
andi a0, a0, 0x7FFFFFFF # error
jr ra
```
However, this program cannot be assembled as-is because the `andi` instruction follows the [I-format](https://itnext.io/risc-v-instruction-set-cheatsheet-70961b4bbe8), and the constant `0x7FFFFFFF` is a 32-bit value that exceeds the 12-bit immediate field allowed in the I-format (as shown in the diagram below, **source**: [StackOverflow](https://stackoverflow.com/questions/39427092/risc-v-immediate-encoding-variants)).

Fortunately, [RISC-V](https://en.wikipedia.org/wiki/RISC-V) provides the `li` [pseudo-instruction](https://homepage.divms.uiowa.edu/~ghosh/2-2-10.pdf), which can handle 32-bit constants by partitioning them into two parts: a 20-bit upper portion and a 12-bit lower portion. Behind the scenes, it uses the `lui` instruction to load the upper 20 bits and `addi` to incorporate the lower 12 bits into the result.
```asm=
fabsf:
li t0, 0x7FFFFFFF # 0b0111_1111_1111_1111_1111_1111_1111_1111
and a0, a0, t0 # ^
jr ra
```
Which is equivalent to:
```asm=
fabsf:
lui t0, 0x7FFFF # 0b0111_1111_1111_1111_1111
addi t0, t0, 0xFFF # 0b1111_1111_1111
and a0, a0, t0
jr ra
```
To test this function, we implement a simple `main` function, as shown below.
```asm=
main:
li a0, 0xFFFFFFFF
jal ra, fabsf
li a7, 10
ecall
fabsf:
li t0, 0x7FFFFFFF
and a0, a0, t0
jr ra
```
The table below provides a [performance benchmark](https://en.wikipedia.org/wiki/Benchmark_(computing)) for this function. For each 32-bit [floating-point number](https://en.wikipedia.org/wiki/Floating-point_arithmetic), represented in [hexadecimal literal](https://en.wikipedia.org/wiki/Hexadecimal) format, the program executes 8 instructions, requiring 18 [clock cycles](https://en.wikipedia.org/wiki/Clock_rate) to complete.
<center>
|Clock Cycles #|Instructions #|CPI|IPC|Used GPR #|
|:-:|:-:|:-:|:-:|:-:|
|18|8|2.25|0.444|3|
</center>center>

### 2.2 `my_clz()` - Counting the Number of Leading Zeros
In this section, we will examine the implementation and optimization of the `my_clz` function, which stands for "[count leading zeros]((https://en.wikipedia.org/wiki/Leading_zero))." To illustrate the optimization process, we will begin with a direct translation of the C code into [RISC-V](https://en.wikipedia.org/wiki/RISC-V) assembly, as shown below.
```asm=
my_clz:
my_clz_prologue:
li t0, 0 # t0 = count = 0
li t1, 31 # t1 = i = 31
my_clz_loop:
blt t1, x0, my_clz_epilogue # exit loop if i < 0
li t2, 1 # t2 = 1
sll t2, t2, t1 # t2 = 1 << i
and t2, t2, a0 # t2 = x & (1 << i)
bne t2, x0, my_clz_epilogue # exit loop if t2 is true
addi t0, t0, 1 # t0 = t0 + 1 = count + 1
addi, t1, t1, -1 # t1 = t1 - 1 = i - 1
j my_clz_loop
my_clz_epilogue:
mv a0, t0 # a0 = count
jr ra # return control
```
We will also include a simple `main` function to benchmark the performance of this direct translation approach.
```asm=
main:
li a0, 0x0
jal ra, my_clz
li a7, 10
ecall
```
Since this algorithm uses [loops](https://skaminsky115.github.io/teaching/cs61c/resources/fa18/Disc04_notes.pdf) that terminate based on the input data, we selected three test values: `0x0`, `0xFFFF`, and `0xFFFFFFFF`. The results are presented in the table below.
<center>
|Test Data|Clock Cycles #|Instructions #|CPI|IPC|Used GPR #|
|:-:|:-:|:-:|:-:|:-:|:-:|
|0x0|==341==|265|1.29|0.777|5|
|0xFFFF|186|142|1.31|0.763|5|
|0xFFFFFFFF|25|13|1.92|0.52|5|
</center>
Upon reviewing the assembly code above, we notice ==redundant computations== when checking if a bit is 0 or 1. Specifically, as we iterate from the 31st bit to the 0th bit, the [bitwise shifts](https://en.wikipedia.org/wiki/Bitwise_operation) are recalculated repeatedly, leading to multiple unnecessary `li` instructions. To optimize this, we can ==utilize a single mask==, `0x80000000`, and shift it logically to the right within the loop. The optimized code is shown below.
```asm=
main:
li a0, 0x0
jal ra, my_clz
li a7, 10
ecall
my_clz:
my_clz_prologue:
li t0, 0 # t0 = count = 0
li t1, 0x80000000 # t1 = mask = 0b1000_0000_0000_0000_0000_0000_0000_0000
my_clz_loop:
beq t1, x0, my_clz_epilogue
and t2, t1, a0
bne t2, x0, my_clz_epilogue
addi t0, t0, 1
srli t1, t1, 1
j my_clz_loop
my_clz_epilogue:
mv a0, t0 # a0 = count
jr ra # return control
```
The table below shows the [benchmarking](https://en.wikipedia.org/wiki/Benchmark_(computing)) results of the optimized code. Notably, for the [worst-case](https://en.wikipedia.org/wiki/Worst-case_complexity) scenario, the number of [clock cycles](https://en.wikipedia.org/wiki/Clock_rate) decreased from 341 to 277, resulting in a ==19% speedup==. However, the number of [general-purpose registers (GPR)](https://www.geeksforgeeks.org/general-purpose-registers/) used remain the same.
<center>
|Test Data|Clock Cycles #|Instructions #|CPI|IPC|Used GPR #|
|:-:|:-:|:-:|:-:|:-:|:-:|
|0x0|==277==|201|1.38|0.726|5|
|0xFFFF|152|108|1.41|0.711|5|
|0xFFFFFFFF|23|11|2.09|0.478|5|
</center>
Given that `clz` is a widely-used function, it is beneficial to explore further [optimizations](https://en.wikipedia.org/wiki/Program_optimization) through available online resources. I encountered a loopless and [branchless](https://en.algorithmica.org/hpc/pipelining/branchless/) approach that employs two [efficient algorithms](https://en.wikipedia.org/wiki/Algorithmic_efficiency) to enhance the performance of `clz`.
The first algorithm pads all bits to the right of the first non-zero [MSB](https://en.wikipedia.org/wiki/Bit_numbering) by repeatedly mirroring the bits from left to right and applying [bitwise OR operations](https://en.wikipedia.org/wiki/Bitwise_operation). The second algorithm is the well-known `pop_count` ([hamming weight](https://en.wikipedia.org/wiki/Hamming_weight)), which encodes 16 pairs of adjacent bits into their corresponding [binary representation](https://en.wikipedia.org/wiki/Binary_number) of the number of 1 bits. These counts are then accumulated to determine the total number of 1 bits in the 32-bit [word](https://en.wikipedia.org/wiki/Word_(computer_architecture)).
<center>
|Bit-1-Before|Bit-0-Before|Bit-1-After|Bit-0-After|
|:-:|:-:|:-:|:-:|
|0|0|0|0|
|0|1|0|1|
|1|0|0|1|
|1|1|1|0|
</center>
According to the [truth table](https://en.wikipedia.org/wiki/Truth_table) above, the encoding process involves subtracting the [Most Significant Bit (MSB)](https://en.wikipedia.org/wiki/Bit_numbering) of each pair of bits from the pair itself.
Since this method ensures that there are no zeros beyond the [leading zeros](https://en.wikipedia.org/wiki/Leading_zero), the final result for the number of [leading zeros](https://en.wikipedia.org/wiki/Leading_zero) is calculated as 32 minus the [population count](https://en.wikipedia.org/wiki/Hamming_weight).
```asm=
main:
li a0, 0x0
jal ra, my_clz
li a7, 10
ecall
my_clz:
my_clz_prologue:
add t0, x0, a0 # t0 = x
my_clz_padding:
srli t1, t0, 1 # t1 = x >> 1
or t0, t0, t1 # t0 = _x = x | (x >> 1)
srli t1, t0, 2 # t1 = _x >> 2
or t0, t0, t1 # t0 = __x = _x | (_x >> 2)
srli t1, t0, 4 # t1 = __x >> 4
or t0, t0, t1 # t0 = ___x = __x | (__x >> 4)
srli t1, t0, 8 # t1 = ___x >> 8
or t0, t0, t1 # t0 = ____x = ___x | (___x >> 8)
srli t1, t0, 16 # t1 = ____x >> 16
or t0, t0, t1 # t0 = p = ____x | (____x >> 16)
my_clz_popcount:
srli t1, t0, 1 # t1 = p >> 1
li t2, 0x55555555 # t2 = mask1 = 0b0101_0101_0101_0101_0101_0101_0101_0101
and t1, t1, t2 # t1 = (p >> 1) & mask1
sub t0, t0, t1 # t0 = c = p - ((p >> 1) & mask1)
srli t1, t0, 2 # t1 = c >> 2
li t2, 0x33333333 # t2 = mask2 = 0b0011_0011_0011_0011_0011_0011_0011_0011
and t1, t1, t2 # t1 = (c >> 2) & mask2
and t2, t0, t2 # t2 = c & mask2
add t0, t1, t2 # t0 = (c >> 2) & mask2 + c & mask2
srli t1, t0, 4 # t1 = m >> 4
add t1, t1, t0 # t1 = (m >> 4) + m
li t2, 0x0F0F0F0F # t2 = mask3 = 0b0000_1111_0000_1111_0000_1111_0000_1111
and t0, t1, t2 # t0 = e = (((m >> 4) + m) & mask3)
srli t1, t0, 8 # t1 = e >> 8
add t0, t0, t1 # t0 = w = e + (e >> 8)
srli t1, t0, 16 # t1 = w >> 16
add t0, t0, t1 # t0 = v = w + (w >> 16)
li t2, 0x3F # t2 = mask4 = 0b0000_0000_0000_0000_0000_0000_0011_1111
and t0, t0, t2 # t0 = v & mask4
li t1, 32 # t1 = 32
sub a0, t1, t0 # a0 = 32 - (v & mask4)
my_clz_epilogue:
jr ra # return control
```
The table below presents the [performance benchmarks](https://en.wikipedia.org/wiki/Benchmark_(computing)) for the [optimized](https://en.wikipedia.org/wiki/Program_optimization) `clz` algorithm. While the [best-case](https://www.geeksforgeeks.org/worst-average-and-best-case-analysis-of-algorithms/) (`0xFFFFFFFF`) performance slightly declined, most other cases, including `0x0` and `0xFFFF`, showed significant improvement. Notably, the original [worst-case](https://www.geeksforgeeks.org/worst-average-and-best-case-analysis-of-algorithms/) scenario improved from 277 [clock cycles](https://en.wikipedia.org/wiki/Clock_rate) to just 50, representing an ==82% speedup== compared to the previous version and an ==85% speedup== compared to the naive, direct translation.
<center>
|Test Data|Clock Cycles #|Instructions #|CPI|IPC|Used GPR #|
|:-:|:-:|:-:|:-:|:-:|:-:|
|0x0|==50==|40|1.25|1.25|5|
|0xFFFF|51|41|1.24|0.804|5|
|0xFFFFFFFF|50|40|1.25|0.8|5|
</center>

### 2.3 `fp16_to_fp32()` - Converting Half to Single-Precision Floating Point Numbers
The function `fp16_to_fp32` converts a 16-bit [half-precision floating-point number (fp16)](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) to a 32-bit [single-precision floating-point number (fp32)](https://en.wikipedia.org/wiki/Single-precision_floating-point_format). It first extends the 16-bit number to 32 bits, separating the [sign, exponent, and mantissa](https://en.wikipedia.org/wiki/IEEE_754).
The [sign](https://en.wikipedia.org/wiki/Sign_bit) is isolated and placed in the [most significant bit](https://en.wikipedia.org/wiki/Bit_numbering), while the [mantissa and exponent](https://en.wikipedia.org/wiki/IEEE_754) are normalized based on their values. Special cases such as [NaN](https://en.wikipedia.org/wiki/NaN), infinity, and zero are handled using specific [masks](https://en.wikipedia.org/wiki/Mask_(computing)), ensuring proper conversion by adjusting the [exponent bias](https://en.wikipedia.org/wiki/IEEE_754) and handling denormalized numbers through [bit shifting](https://www.interviewcake.com/concept/java/bit-shift). Finally, the components are combined to form the [fp32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) result.
```asm=
fp16_to_fp32:
fp16_to_fp32_prologue:
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
fp16_to_fp32_prologue_after:
slli s0, a0, 16 # s0 = w
li s1, 0x80000000 # s1 = sign mask = 0x80000000
and s1, s1, s0 # s1 = sign = w & sign_mask(0x80000000)
li s2, 0x7FFFFFFF # s2 = non_sign mask = 0x7FFFFFFF
and s2, s2, s0 # s2 = no_sign = w & non_sign_mask(0x7FFFFFFF)
mv a0, s2 # a0 = s2 = no_sign
jal ra, my_clz # a0 = renorm_shift = number of leading zeros
li s3, 0 # s3 = renorm_shift = 0
li t0, 5 # t0 = 5
slt t0, t0, a0 # t0 = 5 < renorm_shift
beq t0, x0, fp16_to_fp32_post_overflow_check # branch if 5 < renorm_shift
addi s3, a0, -5 # s3 = renorm_shift = renorm_shift - 5
fp16_to_fp32_post_overflow_check:
li s4, 0x04000000 # s4 = 0x04000000
add s4, s2, s4 # s4 = no_sign + 0x04000000
srli s4, s4, 8 # s4 = (no_sign + 0x04000000) >> 8 # buggy!!!
li t0, 0x7F800000 # t0 = 0x7F800000
and s4, s4, t0 # s4 = inf_nan_mask = ((no_sign + 0x04000000) >> 8) & 0x7F800000
addi s5, s2, -1 # s5 = no_sign - 1
srli s5, s5, 31 # s5 = zero_mask = (no_sign - 1) >> 31
sll t0, s2, s3 # t0 = no_sign << renorm_shift
srli t0, t0, 3 # t0 = (no_sign << renorm_shift) >> 3
li t1, 0x70 # t1 = 0x70
sub t1, t1, s3 # t1 = 0x70 - renorm_shift
slli t1, t1, 23 # t1 = (0x70 - renorm_shift) << 23
add t0, t0, t1 # t0 = (no_sign << renorm_shift) >> 3 + (0x70 - renorm_shift) << 23
or t0, t0, s4 # t0 = t0 | inf_nan_mask
not t1, s5 # t1 = ~zero_mask
and t0, t0, t1 # t0 = t0 & ~zero_mask
or a0, s1, t0 # a0 = sign | t0
fp16_to_fp32_epilogue:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 28
jr ra
```
The program above is a direct translation of the C code into [RISC-V assembly](https://en.wikipedia.org/wiki/RISC-V). Although it appears correct at first glance, there is a significant [bug](https://en.wikipedia.org/wiki/Software_bug) in the code. In the original C code, the expression `(int32_t)(nonsign + 0x04000000) >> 8` casts an [unsigned number](https://en.wikipedia.org/wiki/Signed_number_representations) to a [signed integer](https://en.wikipedia.org/wiki/Signed_number_representations) before performing a [right shift](https://en.wikipedia.org/wiki/Bitwise_operation). Since the result could be negative, [sign extension](https://en.wikipedia.org/wiki/Sign_extension) is mistakenly omitted in the assembly translation.
<center>
|Instruction|Output Hexadecimal Literal|
|:-:|:-:|
|**srli**|0x00830100|
|**srai**|0xFF830100|
</center>
This issue can be easily resolved by replacing the `srli` instruction with `srai`, which stands for =="shift right **arithmetic** immediate"== and ensures proper [sign extension](https://en.wikipedia.org/wiki/Sign_extension). The corrected code is shown below.
```asm=
main:
li a0, 0xFFFFFFFF
jal ra, fp16_to_fp32
li a7, 10
ecall
fp16_to_fp32:
fp16_to_fp32_prologue:
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
fp16_to_fp32_prologue_after:
slli s0, a0, 16 # s0 = w
li s1, 0x80000000 # s1 = sign mask = 0x80000000
and s1, s1, s0 # s1 = sign = w & sign_mask(0x80000000)
li s2, 0x7FFFFFFF # s2 = non_sign mask = 0x7FFFFFFF
and s2, s2, s0 # s2 = no_sign = w & non_sign_mask(0x7FFFFFFF)
mv a0, s2 # a0 = s2 = no_sign
jal ra, my_clz # a0 = renorm_shift = number of leading zeros
li s3, 0 # s3 = renorm_shift = 0
li t0, 5 # t0 = 5
slt t0, t0, a0 # t0 = 5 < renorm_shift
beq t0, x0, fp16_to_fp32_post_overflow_check # branch if 5 < renorm_shift
addi s3, a0, -5 # s3 = renorm_shift = renorm_shift - 5
fp16_to_fp32_post_overflow_check:
li s4, 0x04000000 # s4 = 0x04000000
add s4, s2, s4 # s4 = no_sign + 0x04000000
srai s4, s4, 8 # s4 = (no_sign + 0x04000000) >> 8
li t0, 0x7F800000 # t0 = 0x7F800000
and s4, s4, t0 # s4 = inf_nan_mask = ((no_sign + 0x04000000) >> 8) & 0x7F800000
addi s5, s2, -1 # s5 = no_sign - 1
srli s5, s5, 31 # s5 = zero_mask = (no_sign - 1) >> 31
sll t0, s2, s3 # t0 = no_sign << renorm_shift
srli t0, t0, 3 # t0 = (no_sign << renorm_shift) >> 3
li t1, 0x70 # t1 = 0x70
sub t1, t1, s3 # t1 = 0x70 - renorm_shift
slli t1, t1, 23 # t1 = (0x70 - renorm_shift) << 23
add t0, t0, t1 # t0 = (no_sign << renorm_shift) >> 3 + (0x70 - renorm_shift) << 23
or t0, t0, s4 # t0 = t0 | inf_nan_mask
not t1, s5 # t1 = ~zero_mask
and t0, t0, t1 # t0 = t0 & ~zero_mask
or a0, s1, t0 # a0 = sign | t0
fp16_to_fp32_epilogue:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 28
jr ra
```
The subsequent table depicts the performance of this algorithm based on various metrics.
<center>
|Clock Cycles #|Instructions #|CPI|IPC|Used GPR #|
|:-:|:-:|:-:|:-:|:-:|
|104|87|1.2|0.837|10|
</center>
From the code above, we can see that the `s0` register is only used in the first few [instructions](https://en.wikipedia.org/wiki/Assembly_language), suggesting that it can be reused later. Additionally, we can utilize a [temporary register](https://en.wikipedia.org/wiki/Processor_register) `t0` to store the [hexadecimal literal](https://en.wikipedia.org/wiki/Hexadecimal) `0x7FFFFFFF`, which can be discarded afterward. Since `s2` is no longer needed, we can rename registers: `s3` to `s2`, `s4` to `s3`, and `s5` to `s4`. Consequently, we can eliminate the need to push and pop the `s5` register to and from the [stack](https://en.wikipedia.org/wiki/Stack_(abstract_data_type)). The [optimized code](https://en.wikipedia.org/wiki/Program_optimization) is shown below.
```asm=
fp16_to_fp32:
fp16_to_fp32_prologue:
addi sp, sp, -24
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
fp16_to_fp32_prologue_after:
slli s0, a0, 16 # s0 = w
li s1, 0x80000000 # s1 = sign mask = 0x80000000
and s1, s1, s0 # s1 = sign = w & sign_mask(0x80000000)
li t0, 0x7FFFFFFF # t0 = non_sign mask = 0x7FFFFFFF
and s0, t0, s0 # s0 = no_sign = w & non_sign_mask(0x7FFFFFFF)
mv a0, s0 # a0 = s0 = no_sign
jal ra, my_clz # a0 = renorm_shift = number of leading zeros
li s2, 0 # s2 = renorm_shift = 0
li t0, 5 # t0 = 5
slt t0, t0, a0 # t0 = 5 < renorm_shift
beq t0, x0, fp16_to_fp32_post_overflow_check # branch if 5 < renorm_shift
addi s2, a0, -5 # s3 = renorm_shift = renorm_shift - 5
fp16_to_fp32_post_overflow_check:
li s3, 0x04000000 # s3 = 0x04000000
add s3, s0, s3 # s3 = no_sign + 0x04000000
srai s3, s3, 8 # s3 = (no_sign + 0x04000000) >> 8
li t0, 0x7F800000 # t0 = 0x7F800000
and s3, s3, t0 # s3 = inf_nan_mask = ((no_sign + 0x04000000) >> 8) & 0x7F800000
addi s4, s0, -1 # s4 = no_sign - 1
srli s4, s4, 31 # s4 = zero_mask = (no_sign - 1) >> 31
sll t0, s0, s2 # t0 = no_sign << renorm_shift
srli t0, t0, 3 # t0 = (no_sign << renorm_shift) >> 3
li t1, 0x70 # t1 = 0x70
sub t1, t1, s2 # t1 = 0x70 - renorm_shift
slli t1, t1, 23 # t1 = (0x70 - renorm_shift) << 23
add t0, t0, t1 # t0 = (no_sign << renorm_shift) >> 3 + (0x70 - renorm_shift) << 23
or t0, t0, s3 # t0 = t0 | inf_nan_mask
not t1, s4 # t1 = ~zero_mask
and t0, t0, t1 # t0 = t0 & ~zero_mask
or a0, s1, t0 # a0 = sign | t0
fp16_to_fp32_epilogue:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
addi sp, sp, 24
jr ra
```
The [optimized code](https://en.wikipedia.org/wiki/Program_optimization) takes 102 [clock cycles](https://en.wikipedia.org/wiki/Clock_rate) to finish, which is 2 [clock cycles](https://en.wikipedia.org/wiki/Clock_rate) faster than the original code.
<center>
|Clock Cycles #|Instructions #|CPI|IPC|Used GPR #|
|:-:|:-:|:-:|:-:|:-:|
|==102==|==85==|1.2|0.833|9|
</center>

---
## 3. LeetCode 977 - Squares of A Sorted Array
### 3.1 Problem Description
Given an integer array nums sorted in non-decreasing order, return an array of the squares of each number sorted in non-decreasing order.
### 3.2 C Program Implementation
The most straightforward solution to this problem is to iterate through the array, square each number, and then apply a [sorting algorithm](https://en.wikipedia.org/wiki/Sorting_algorithm) to obtain the result in non-decreasing order. However, this approach is suboptimal because it involves sorting the array, which, when using [merge sort](https://en.wikipedia.org/wiki/Merge_sort), results in a [time complexity](https://en.wikipedia.org/wiki/Time_complexity) of $O(nlogn)$.
A more efficient approach is to locate the separation point between negative and non-negative numbers and use three index counters: one to move left and two to move right. We then compare the squared values and store the larger one in the result array. The following is the C implementation.
```c=
int min(int a, int b) {
return a < b ? a : b;
}
int* sortedSquares(int* nums, int numsSize, int* returnSize) {
int* result = (int*) malloc(sizeof(int) * numsSize);
int splitPoint = numsSize;
for (int i = 0; i < numsSize; ++i) {
if (nums[i] >= 0) {
splitPoint = i;
break;
}
}
int res = 0, left = splitPoint - 1, right = splitPoint;
while (left >= 0 && right < numsSize) {
if (-nums[left] < nums[right])
result[res++] = nums[left] * nums[left--];
else
result[res++] = nums[right] * nums[right++];
}
while (left >= 0)
result[res++] = nums[left] * nums[left--];
while (right < numsSize)
result[res++] = nums[right] * nums[right++];
*returnSize = numsSize;
return result;
}
```
### 3.3 RISC-V Implementation
In the C program, memory was allocated on the [heap](https://en.wikipedia.org/wiki/Heap_(data_structure)) using `malloc`, with the assumption that the caller would remember to free the [allocated memory](https://en.wikipedia.org/wiki/Memory_management). However, in the [RISC-V program](https://en.wikipedia.org/wiki/RISC-V), [allocating memory](https://en.wikipedia.org/wiki/Memory_management) on the [heap](https://en.wikipedia.org/wiki/Heap_(data_structure)) is more complex because it requires making [OS calls](https://en.wikipedia.org/wiki/System_call), which are not supported in Ripes. As a result, I opted to store the result in memory allocated in the data section, which resides on the [stack](https://en.wikipedia.org/wiki/Stack_(abstract_data_type)). The following is the [RISC-V](https://en.wikipedia.org/wiki/RISC-V) code implementation with automated error checking.
```asm=
.data
#array: .word -4, -1, 0, 3, 10
#array: .word -7, -3, 2, 3, 11
array: .word -1, 3, 8, 10, 22
array_size: .word 5
output_array: .word 0, 0, 0, 0, 0
#output_answer: .word 0, 1, 9, 16, 100
#output_answer: .word 4, 9, 9, 49, 121
output_answer: .word 1, 9, 64, 100, 484
string_match_message: .asciz "The answer is correct."
string_mismatch_message: .asciz "The answer is incorrect."
.text
main:
la a0, array
la a1, array_size
lw a1, 0(a1)
la a2, output_array
jal ra, sorted_squares
la a0, output_array
la a1, output_answer
lw a2, array_size
jal ra, check_result
bne a0, x0, answer_match
answer_mismatch:
la a0, string_mismatch_message
li a7, 4
ecall
j answer_check_end
answer_match:
la a0, string_match_message
li a7, 4
ecall
answer_check_end:
li a7, 10
ecall
check_result:
mv t0, x0 # t0 = i = 0
check_result_loop:
slt t1, t0, a2 # t1 = i < n
beq t1, x0, check_result_loop_end # terminate loop if i >= n
slli t1, t0, 2 # t1 = i << 2
add t2, a0, t1 # t2 = &nums + (i << 2)
add t3, a1, t1 # t3 = &result + (i << 2)
lw t2, 0(t2) # t2 = nums[i]
lw t3, 0(t3) # t3 = result[i]
beq t2, t3, continue_loop
mv a0, x0
jr ra
continue_loop:
addi t0, t0, 1 # t0 = t0 + 1; i = i + 1
j check_result_loop
check_result_loop_end:
addi a0, x0, 1
jr ra
sorted_squares:
mv t0, x0 # t0 = 4i = 0
mv s0, a1 # s0 = splitPoint = numsSize
slli t2, a1, 2 # t2 = 4n = 4 * numsSize
loop_for_split_point:
bge t0, t2, loop_for_split_point_end # terminate loop if 4i >= 4n
add t1, a0, t0 # t2 = &nums + 4i
lw t1, 0(t1) # t2 = nums[i]
bge t1, x0, after_split_point # terminate loop if nums[0] >= 0
addi t0, t0, 4 # t0 = 4i + 4
j loop_for_split_point # repeat loop
after_split_point:
srli s0, t0, 2 # s0 = 4i / 4 = i
loop_for_split_point_end: # s0 = right
addi s1, s0, -1 # s1 = left = splitPoint - 1
mv s2, x0 # s2 = resIdx = 0
bidirectional_loop:
blt s1, x0, bidirectional_loop_end # terminate loop if left < 0
slt t1, s0, a1 # t1 = right < numsSize
beq t1, x0, bidirectional_loop_end # terminate loop if right >= numsSize
slli t0, s1, 2 # t0 = left << 2
add t0, a0, t0 # t0 = &nums + (left << 2)
lw t0, 0(t0) # t0 = nums[left]
sub t0, x0, t0 # t0 = -nums[left]
slli t1, s0, 2 # t1 = right << 2
add t1, a0, t1 # t1 = &nums + (right << 2)
lw t1, 0(t1) # t1 = nums[right]
blt t0, t1, select_right
mul t2, t1, t1 # t2 = nums[right] * nums[right]
addi s0, s0, 1 # s0 = s0 + 1; right = right + 1
j after_select
select_right:
mul t2, t0, t0 # t2 = -nums[left] * -nums[left]
addi s1, s1, -1 # s1 = s1 - 1; left = left - 1
after_select:
slli t0, s2, 2 # t0 = resIdx << 2
add t0, a2, t0 # t0 = &result + (resIdx << 2)
sw t2, 0(t0) # store t2 to result[resIdx]
addi s2, s2, 1 # s2 = s2 + 1; resIdx = resIdx + 1
j bidirectional_loop
bidirectional_loop_end:
blt s1, x0, left_directional_end
slli t0, s1, 2 # t0 = left << 2
add t0, a0, t0 # t0 = &nums + (left << 2)
lw t0, 0(t0) # t0 = nums[left]
mul t0, t0, t0 # t0 = nums[left] * nums[left]
slli t1, s2, 2 # t1 = resIdx << 2
add t1, a2, t1 # t1 = &result + (resIdx << 2)
sw t0, 0(t1) # store t0 to result[resIdx]
addi s1, s1, -1 # s1 = s1 - 1; left = left - 1
addi s2, s2, 1 # s2 = s2 + 1; resIdx = resIdx + 1
j bidirectional_loop_end
left_directional_end:
slt t0, s0, a1 # t0 = right < numsSize
beq t0, x0, right_directional_end # terminate loop if right >= numsSize
slli t0, s0, 2 # t0 = right << 2
add t0, a0, t0 # t0 = &nums + (right << 2)
lw t0, 0(t0) # t0 = nums[right]
slli t1, s2, 2 # t1 = resIdx << 2
add t1, a2, t1 # t1 = &result + (resIdx << 2)
mul t0, t0, t0 # t0 = t0 * t0 = nums[right] * nums[right]
sw t0, 0(t1) # store t0 to result[resIdx]
addi s0, s0, 1 # s0 = s0 + 1; right = right + 1
addi s2, s2, 1 # s2 = s2 + 1; resIdx = resIdx + 1
j left_directional_end
right_directional_end:
jr ra
```
### 3.4 RISC-V Implementation with CLZ Optimization
In the above code, since we are storing the squared values in an independent array, it is advantageous to dynamically determine the number of [bytes](https://en.wikipedia.org/wiki/Byte) needed for storage. This can be achieved by calculating the smallest [CLZ (Count Leading Zeros)](https://en.wikipedia.org/wiki/Leading_zero) values within the array. Since the squared numbers are always positive, we can [optimize](https://en.wikipedia.org/wiki/Program_optimization) storage by using a [byte](https://en.wikipedia.org/wiki/Byte) if the number is greater than 24 bits (ensuring the [MSB](https://en.wikipedia.org/wiki/Bit_numbering) remains 0 for positive values), a [half-word](https://en.wikipedia.org/wiki/Word_(computer_architecture)) if it exceeds 16 bits, and a [full word](https://en.wikipedia.org/wiki/Word_(computer_architecture)) if it is less than or equal to 16 bits.
```asm=
.data
#array: .word -4, -1, 0, 3, 10
#array: .word -7, -3, 2, 3, 11
array: .word -1, 3, 8, 10, 22
array_size: .word 5
output_array: .word 0, 0, 0, 0, 0
#output_answer: .word 0, 1, 9, 16, 100
#output_answer: .word 4, 9, 9, 49, 121
output_answer: .word 1, 9, 64, 100, 484
string_match_message: .asciz "The answer is correct."
string_mismatch_message: .asciz "The answer is incorrect."
output_array_final: .word 0, 0, 0, 0, 0
.text
main:
la a0, array
la a1, array_size
lw a1, 0(a1)
la a2, output_array
jal ra, sorted_squares
la a0, output_array
la a1, output_answer
lw a2, array_size
jal ra, check_result
bne a0, x0, answer_match
answer_mismatch:
la a0, string_mismatch_message
li a7, 4
ecall
j answer_check_end
answer_match:
la a0, string_match_message
li a7, 4
ecall
answer_check_end:
la a0, output_array
lw a1, array_size
jal compute_bytes_needed
mv a3, a0
la a0, output_array
lw a1, array_size
la a2, output_array_final
jal move_data
li a7, 10
ecall
check_result:
mv t0, x0 # t0 = i = 0
check_result_loop:
slt t1, t0, a2 # t1 = i < n
beq t1, x0, check_result_loop_end # terminate loop if i >= n
slli t1, t0, 2 # t1 = i << 2
add t2, a0, t1 # t2 = &nums + (i << 2)
add t3, a1, t1 # t3 = &result + (i << 2)
lw t2, 0(t2) # t2 = nums[i]
lw t3, 0(t3) # t3 = result[i]
beq t2, t3, continue_loop
mv a0, x0
jr ra
continue_loop:
addi t0, t0, 1 # t0 = t0 + 1; i = i + 1
j check_result_loop
check_result_loop_end:
addi a0, x0, 1
jr ra
my_clz:
my_clz_prologue:
add t0, x0, a0 # t0 = x
my_clz_padding:
srli t1, t0, 1 # t1 = x >> 1
or t0, t0, t1 # t0 = _x = x | (x >> 1)
srli t1, t0, 2 # t1 = _x >> 2
or t0, t0, t1 # t0 = __x = _x | (_x >> 2)
srli t1, t0, 4 # t1 = __x >> 4
or t0, t0, t1 # t0 = ___x = __x | (__x >> 4)
srli t1, t0, 8 # t1 = ___x >> 8
or t0, t0, t1 # t0 = ____x = ___x | (___x >> 8)
srli t1, t0, 16 # t1 = ____x >> 16
or t0, t0, t1 # t0 = p = ____x | (____x >> 16)
my_clz_popcount:
srli t1, t0, 1 # t1 = p >> 1
li t2, 0x55555555 # t2 = mask1 = 0b0101_0101_0101_0101_0101_0101_0101_0101
and t1, t1, t2 # t1 = (p >> 1) & mask1
sub t0, t0, t1 # t0 = c = p - ((p >> 1) & mask1)
srli t1, t0, 2 # t1 = c >> 2
li t2, 0x33333333 # t2 = mask2 = 0b0011_0011_0011_0011_0011_0011_0011_0011
and t1, t1, t2 # t1 = (c >> 2) & mask2
and t2, t0, t2 # t2 = c & mask2
add t0, t1, t2 # t0 = (c >> 2) & mask2 + c & mask2
srli t1, t0, 4 # t1 = m >> 4
add t1, t1, t0 # t1 = (m >> 4) + m
li t2, 0x0F0F0F0F # t2 = mask3 = 0b0000_1111_0000_1111_0000_1111_0000_1111
and t0, t1, t2 # t0 = e = (((m >> 4) + m) & mask3)
srli t1, t0, 8 # t1 = e >> 8
add t0, t0, t1 # t0 = w = e + (e >> 8)
srli t1, t0, 16 # t1 = w >> 16
add t0, t0, t1 # t0 = v = w + (w >> 16)
li t2, 0x3F # t2 = mask4 = 0b0000_0000_0000_0000_0000_0000_0011_1111
and t0, t0, t2 # t0 = v & mask4
li t1, 32 # t1 = 32
sub a0, t1, t0 # a0 = 32 - (v & mask4)
my_clz_epilogue:
jr ra # return control
sorted_squares:
mv t0, x0 # t0 = 4i = 0
mv s0, a1 # s0 = splitPoint = numsSize
slli t2, a1, 2 # t2 = 4n = 4 * numsSize
loop_for_split_point:
bge t0, t2, loop_for_split_point_end # terminate loop if 4i >= 4n
add t1, a0, t0 # t2 = &nums + 4i
lw t1, 0(t1) # t2 = nums[i]
bge t1, x0, after_split_point # terminate loop if nums[0] >= 0
addi t0, t0, 4 # t0 = 4i + 4
j loop_for_split_point # repeat loop
after_split_point:
srli s0, t0, 2 # s0 = 4i / 4 = i
loop_for_split_point_end: # s0 = right
addi s1, s0, -1 # s1 = left = splitPoint - 1
mv s2, x0 # s2 = resIdx = 0
bidirectional_loop:
blt s1, x0, bidirectional_loop_end # terminate loop if left < 0
slt t1, s0, a1 # t1 = right < numsSize
beq t1, x0, bidirectional_loop_end # terminate loop if right >= numsSize
slli t0, s1, 2 # t0 = left << 2
add t0, a0, t0 # t0 = &nums + (left << 2)
lw t0, 0(t0) # t0 = nums[left]
sub t0, x0, t0 # t0 = -nums[left]
slli t1, s0, 2 # t1 = right << 2
add t1, a0, t1 # t1 = &nums + (right << 2)
lw t1, 0(t1) # t1 = nums[right]
blt t0, t1, select_right
mul t2, t1, t1 # t2 = nums[right] * nums[right]
addi s0, s0, 1 # s0 = s0 + 1; right = right + 1
j after_select
select_right:
mul t2, t0, t0 # t2 = -nums[left] * -nums[left]
addi s1, s1, -1 # s1 = s1 - 1; left = left - 1
after_select:
slli t0, s2, 2 # t0 = resIdx << 2
add t0, a2, t0 # t0 = &result + (resIdx << 2)
sw t2, 0(t0) # store t2 to result[resIdx]
addi s2, s2, 1 # s2 = s2 + 1; resIdx = resIdx + 1
j bidirectional_loop
bidirectional_loop_end:
blt s1, x0, left_directional_end
slli t0, s1, 2 # t0 = left << 2
add t0, a0, t0 # t0 = &nums + (left << 2)
lw t0, 0(t0) # t0 = nums[left]
mul t0, t0, t0 # t0 = nums[left] * nums[left]
slli t1, s2, 2 # t1 = resIdx << 2
add t1, a2, t1 # t1 = &result + (resIdx << 2)
sw t0, 0(t1) # store t0 to result[resIdx]
addi s1, s1, -1 # s1 = s1 - 1; left = left - 1
addi s2, s2, 1 # s2 = s2 + 1; resIdx = resIdx + 1
j bidirectional_loop_end
left_directional_end:
slt t0, s0, a1 # t0 = right < numsSize
beq t0, x0, right_directional_end # terminate loop if right >= numsSize
slli t0, s0, 2 # t0 = right << 2
add t0, a0, t0 # t0 = &nums + (right << 2)
lw t0, 0(t0) # t0 = nums[right]
slli t1, s2, 2 # t1 = resIdx << 2
add t1, a2, t1 # t1 = &result + (resIdx << 2)
mul t0, t0, t0 # t0 = t0 * t0 = nums[right] * nums[right]
sw t0, 0(t1) # store t0 to result[resIdx]
addi s0, s0, 1 # s0 = s0 + 1; right = right + 1
addi s2, s2, 1 # s2 = s2 + 1; resIdx = resIdx + 1
j left_directional_end
right_directional_end:
jr ra
compute_bytes_needed:
compute_bytes_prologue:
addi sp, sp, -16 # update stack position to allocate 4 words
sw s0, 0(sp) # push s0 to stack
sw s1, 4(sp) # push s1 to stack
sw s2, 8(sp) # push s2 to stack
sw ra, 12(sp) # push ra to stack
compute_bytes_prologue_end:
mv s0, a0 # s0 = int_list
addi s1, a1, -1 # s1 = idx = n - 1
slli s1, s1, 2 # s1 = 4 * idx = 4 * (n - 1)
li s2, 32 # s2 = min_leading_zeros
compute_bytes_loop:
blt s1, x0, compute_bytes_epilogue # terminate loop if (idx * 4) < 0
add t0, s0, s1 # t0 = &int_list + (idx * 4)
lw a0, 0(t0) # a0 = int_list[idx]
jal my_clz # a0 = num_leading_zeros
blt s2, a0, skip_update # skip updating s2 if s2 < a0
mv s2, a0 # s2 = a0 = smaller min_leading_zeros
skip_update:
addi s1, s1, -4 # s1 = s1 - 4; 4 * idx = 4 * idx - 4
j compute_bytes_loop # continue loop
compute_bytes_epilogue:
li t0, 32 # t0 = 32 (constant)
sub a0, t0, s2 # a0 = number of bits needed = 32 - min_leading_zeros
lw s0, 0(sp) # pop s0 from stack
lw s1, 4(sp) # pop s1 from stack
lw s2, 8(sp) # pop s2 from stack
lw ra, 12(sp) # pop ra from stack
addi sp, sp, 16 # update stack position to deallocate 4 words
jr ra # return to caller
move_data:
# a0: num[]
# a1: num_size
# a2: res[]
# a3: num_bits
li t0, 0 # t0 = idx = 0
move_data_loop:
slt t1, t0, a1 # t1 = idx < n
beq t1, x0, loop_end # terminate loop if (idx >= n)
slli t1, t0, 2 # t1 = idx << 2
add t1, a0, t1 # t1 = &nums + (idx << 2)
lw t1, 0(t1) # t1 = nums[idx]
slti t2, a3, 8 # t2 = num_bits < 8
bne t2, x0, store_byte # store byte if (num_bits < 8)
slti t2, a3, 16 # t2 = num_bits < 16
bne t2, x0, store_half # store half if (num_bits < 16)
addi t2, t0, 2 # t2 = idx << 2
add t2, a2, t2 # t2 = res + idx << 2
sw t1, 0(t2) # store nums[idx] to res[idx]
j end_store
store_half:
slli t2, t0, 1 # t2 = idx << 1
add t2, a2, t2 # t2 = res + (idx << 1)
sh t1, 0(t2) # store nums[idx] to res[idx]
j end_store
store_byte:
add t2, a2, t0 # t2 = res + idx
sb t1, 0(t2) # store nums[idx] to res[idx]
end_store:
addi t0, t0, 1 # t0 = t0 + 1; idx = idx + 1
j move_data_loop
loop_end:
jr ra
```
For the test data `[-4, -1, 0, 3, 10]`, the output becomes `[0, 1, 9, 16, 100]`. The table below shows the number of [bytes](https://en.wikipedia.org/wiki/Byte) required to store the resulting array. As indicated, the minimum [CLZ](https://en.wikipedia.org/wiki/Leading_zero) score is 25, which exceeds 24, meaning a single [byte](https://en.wikipedia.org/wiki/Byte) per number suffices.
<center>
|Decimal|Hexadecimal|CLZ|Unit Needed|Data in Memory|
|:-:|:-:|:-:|:-:|:-:|
|0|0x0|32|Byte|0x00|
|1|0x1|31|Byte|0x01|
|9|0x9|28|Byte|0x09|
|16|0x10|27|Byte|0x10|
|100|0x64|25|Byte|0x64|
</center>
We can now examine the memory to confirm that the data is stored as [bytes](https://en.wikipedia.org/wiki/Byte). The screenshot below verifies that the array has been compressed using smaller units without any loss of information.

For the test data `[-1, 3, 8, 10, 22]`, the output becomes `[1, 9, 64, 100, 484]`. The table demonstrates that the minimum [CLZ](https://en.wikipedia.org/wiki/Leading_zero) is 23, which is less than 24, requiring a [half-word](https://en.wikipedia.org/wiki/Word_(computer_architecture)) unit to store each element of the array.
<center>
|Decimal|Hexadecimal|CLZ|Unit Needed|Data in Memory|
|:-:|:-:|:-:|:-:|:-:|
|1|0x1|31|Byte|0x0001|
|9|0x9|28|Byte|0x0009|
|64|0x40|28|Byte|0x0040|
|100|0x64|25|Byte|0x0064|
|==484==|==0x1E4==|==23==|==Half Word==|==0x01E4==|
</center>
The screenshot below confirms that the data is indeed stored as [half-words](https://en.wikipedia.org/wiki/Word_(computer_architecture)) in memory.
