contributed by < ranvd
>
The sine function holds significant importance in both mathematics and computer graphics. Within the context of the project, the sine and cosine functions are employed to calculate the phase of where lines should be drawn.
Since sine function is defined as the ratio of the height of an angle to its corresponding hypotenuse, the value always sit in
According to Taylor expansion, any bounded continuous function can be approximate by Taylor series. So, In computer science, we can use this technique to get the sine value and fix the domain in
fmul32
Since we don't support any NaN, Inf and denormalize number as the input. We can still provide a "just ready to used" floating point multiplication. We can return zero by checking the a and b value if either of them equals zero.
Furthermore, fundamental overflow detection can be carried out by examining the MSB in the result of the multiplication and confirming the equality of the true signed bits. Consequently, the statement r ^ ia ^ ib
can be generated. If an overflow occurs, the variable ovfl
will be set to -1; otherwise, it will be set to 0.
float fmul32(float a, float b)
{
/* TODO: Special values like NaN and INF */
int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
if (ia == 0 || ib == 0) return 0;
/* mantissa */
int32_t ma = (ia & 0x7FFFFF) | 0x800000;
int32_t mb = (ib & 0x7FFFFF) | 0x800000;
int32_t sea = ia & 0xFF800000;
int32_t seb = ib & 0xFF800000;
/* result of mantissa */
int32_t m = imul24(ma, mb);
int32_t mshift = getbit(m, 24);
m >>= mshift;
int32_t r = ((sea - 0x3f800000 + seb) & 0xFF800000) + m - (0x800000 & -!mshift);
int32_t ovfl = (r ^ ia ^ ib) >> 31;
r = r ^ ( (r ^ 0x7f800000) & ovfl);
return *(float *)&r;
}
imul32
In floating-point multiplication, only the upper 25 bits are necessary when performing mantissa multiplication. To accommodate this, I have modified the function name from imul32
to imul24
in the following code. Instead of shifting the addend to the right, the approach involves shifting the result left, thereby ensuring the upper bit is consistently preserved.
static int32_t imul24(int32_t a, int32_t b)
{
uint32_t r = 0;
for (; b; b >>= 1)
r = (r >> 1) + (a & -getbit(b, 0));
return r;
}
Since the input argument of b
is always 24-bit lenght. I can easily use loop unrolling technique as below to reduce to execution cycle.
static int32_t imul24(int32_t a, int32_t b)
{
uint32_t r = 0;
for (; b; b >>= 4){
r = (r >> 1) + (a & -getbit(b, 0));
r = (r >> 1) + (a & -getbit(b, 1));
r = (r >> 1) + (a & -getbit(b, 2));
r = (r >> 1) + (a & -getbit(b, 3));
}
return r;
}
fdiv32
In the fdiv32
function, NaN, inf and denormalized number are not allowed as fmul32
. I still provide the basic floating point multiplication by the technique as the mention above.
float fdiv32(float a, float b)
{
int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
if (a == 0) return a;
if (b == 0) return *(float*)&(int){0x7f800000};
/* mantissa */
int32_t ma = (ia & 0x7FFFFF) | 0x800000;
int32_t mb = (ib & 0x7FFFFF) | 0x800000;
/* sign and exponent */
int32_t sea = ia & 0xFF800000;
int32_t seb = ib & 0xFF800000;
/* result of mantissa */
int32_t m = idiv24(ma, mb);
int32_t mshift = !getbit(m, 31);
m <<= mshift;
int32_t r = ((sea - seb + 0x3f800000) - (0x800000 & -mshift)) | (m & 0x7fffff00) >> 8;
int32_t ovfl = (ia ^ ib ^ r) >> 31;
r = r ^ ((r ^ 0x7f800000) & ovfl);
return *(float *) &r;
// return a / b;
}
idiv24
idiv24
can handle 32 bits integer division. The dividend and quotient can be update by shifting 1 bit left. If the dividend are greater then the divisor, the LSB of quotient should be 1. This is basicly the same as short division.
static int32_t idiv24(int32_t a, int32_t b){
uint32_t r = 0;
for (int i = 0; i < 32; i++){
r <<= 1;
if (a - b < 0){
a <<= 1;
continue;
}
r |= 1;
a -= b;
a <<= 1;
}
return r;
}
In the above code, we can know that a
always need to be shifted left. The only difference is whether b
should be subtract from a or not. In the opposite way, we can use -(a < 0)
to determine whether b
should be added back to a
or not, which is called Restoring division
static int32_t idiv24(int32_t a, int32_t b) {
uint32_t r = 0;
for (int i = 0; i < 32; i++) {
a -= b;
r = (r << 1) | a >= 0;
a = (a + (b & -(a < 0))) << 1;
}
return r;
}
Also, the loop time is fixed, the loop unrolling technique can fit very will.
Ignore the loop unrolling example (Can be done by simply copy paste).
fadd32
In floating-point addition, it is essential to align the smaller value with the greater value. In the straightforward approach, the comparison typically involves checking the exponent first and then comparing the mantissa. However, by leveraging the IEEE 754 format, this comparison can be performed simultaneously, which literaily means compare it directly, as the cmp_a
and cmp_b
shows.
After aligning the values, it is necessary to compare the equality of their signed bits. If the signed bits are not the same, subtraction should be applied.
float fadd32(float a, float b) {
int32_t ia = *(int32_t *)&a, ib = *(int32_t *)&b;
int32_t cmp_a = ia & 0x7fffffff;
int32_t cmp_b = ib & 0x7fffffff;
if (cmp_a < cmp_b)
iswap(ia, ib);
/* exponent */
int32_t ea = (ia >> 23) & 0xff;
int32_t eb = (ib >> 23) & 0xff;
/* mantissa */
int32_t ma = ia & 0x7fffff | 0x800000;
int32_t mb = ib & 0x7fffff | 0x800000;
int32_t align = (ea - eb > 24) ? 24 : (ea - eb);
mb >>= align;
if ((ia ^ ib) >> 31) {
ma -= mb;
} else {
ma += mb;
}
int32_t clz = count_leading_zeros(ma);
int32_t shift = 0;
if (clz <= 8) {
shift = 8 - clz;
ma >>= shift;
ea += shift;
} else {
shift = clz - 8;
ma <<= shift;
ea -= shift;
}
int32_t r = ia & 0x80000000 | ea << 23 | ma & 0x7fffff;
float tr = a + b;
return *(float *)&r;
}
f2i32
& i2f32
The functions i2f32
and f2i32
are used for converting values between floating-point and integer representations. While 2 means "to". i2f
means "int to float".
int f2i32(int x) {
int32_t a = *(int *)&x;
int32_t ma = (a & 0x7FFFFF) | 0x800000;
int32_t ea = ((a >> 23) & 0xFF) - 127;
if (ea < 0)
return 0;
else if (ea <= 23)
ma >>= (23 - ea);
else
ma <<= (ea - 23);
return ma;
}
int i2f32(int x) {
if (x == 0) return 0;
int32_t s = x & 0x80000000;
if (s) x = -x;
int32_t clz = count_leading_zeros(x);
int32_t e = 31 - clz + 127;
if (clz <= 8) {
x >>= 8 - clz;
} else {
x <<= clz - 8;
}
int r = s | e << 23 | x & 0x7fffff;
return r;
}
Sin
The sine value can be obtained using the Taylor series. The following code represents the corresponding implementation of the Taylor series for sine calculation
float myPow(float x, int n) {
float r = 1.0;
while (n) {
if (n & 0x1) {
r = fmul32(r, x);
n -= 1;
} else {
x = fmul32(x, x);
n >>= 1;
}
}
return r;
}
// n!
float factorial(int n) {
float r = 1.0;
for (int i = 1; i <= n; i++) {
r = fmul32(r, i2f32(i));
}
return r;
}
// Sine by Taylor series
float mySin(float x) {
float r = 0.0;
for (int n = 0; n < 5;
n++) {
int k = f2i32(fadd32(fmul32(i2f32(2), i2f32(n)), i2f32(1)));
int s = 1 ^ ((-2) & -(n & 0x1));
r = fadd32(r, fdiv32(fmul32(i2f32(s), myPow(x, k)), factorial(k)));
}
return r;
}
with -O0
option
with -O2
option
with -O3
option
fmul32
& imul24
By directly incorporating the logic of imul24
within fmul32
, we eliminate the need for an additional function call, which can improve performance and reduce the store/load operation overhead. This technique is often employed in optimization to enhance the efficiency of code execution.
fmul32:
# prologue
addi sp, sp, -4
sw ra, 0(sp)
seqz t0, a0
seqz t1, a0
or t0, t0, t1
beqz t0, 2f
li a0, 0
j 3f
2: li t2, 0x7FFFFF
and t0, a0, t2
and t1, a1, t2
addi t2, t2, 1
or t0, t0, t2 # t0 = ma
or t1, t1, t2 # t1 = mb
imul24:
li t3, 0 # t3 = m, r(in imul24)
1:
andi t4, t1, 0x1
neg t4, t4
and t4, t0, t4
srli t3, t3, 1
add t3, t3, t4
srli t1, t1, 1
bnez t1, 1b
mv t0, a0 # t0 = a
mv t1, a1 # t1 = b
mv a0, t3
li a1, 24
jal getbit # a0 = mshift
# m(t3) value computed
srl t3, t3, a0
li t2, 0xFF800000
and t0, t0, t2 # t0 = sea
and t1, t1, t2 # t1 = seb
li t2, 0x3f800000
sub t4, t0, t2
add t4, t4, t1
li t2, 0xFF800000
and t4, t4, t2 # t4 = ((sea - 0x3f800000 + seb) & 0xFF800000)
li t2, 0x7fffff
slli a0, a0, 23
or a0, a0, t2
and a0, a0, t3
add a0, a0, t4 # a0 = r(in fmul32)
# check overflow
xor t3, t0, t1
xor t3, t3, a0
srli t3, t3, 31 # t3 = ovfl
li t2, 0x7f800000
xor t4, t2, a0
and t4, t4, t3
xor a0, a0, t4
3: # epilogue
lw ra, 0(sp)
addi sp, sp, 4
ret
fdiv32
& idiv24
The same idea comes to fdiv32 and idiv32. We embedded the idiv24
into fdiv32
to prevent additional load/store overhead.
fdiv32:
# prologue
addi sp, sp, -4
sw ra, 0(sp)
beqz a0, 3f
bnez a1, 1f
li a0, 0x7f800000
j 3f
1:
li t2, 0x7FFFFF
and t0, t2, a0
and t1, t2, a1
addi t2, t2, 1
or t0, t0, t2 # t0 = ma
or t1, t1, t2 # t1 = mb
idiv24:
li t3, 0 # t3 = m, r(in idiv24)
li t4, 32 # t4 = end condition
2:
sub t0, t0, t1
sltz t2, t0
seqz t2, t2
slli t3, t3, 1
or t3, t3, t2
seqz t2, t2
neg t2, t2
and t5, t2, t1 # t5 = b & -(a < 0)
add t0, t0, t5
slli t0, t0, 1
addi t4, t4, -1
bnez t4, 2b
li t2, 0xFF800000
and t0, a0, t2 # t0 = sea
and t1, a1, t2 # t1 = seb
mv a0, t3
li a1, 31
jal getbit
seqz a0, a0 # a0 = mshift
sll t3, t3, a0 # t3 = m
li t2, 0x3f800000
sub t4, t0, t1
add t4, t4, t2 # t4 = sea - seb + 0x3f800000
neg a0, a0
li t2, 0x800000
and a0, a0, t2 # a0 = 0x800000 & -mshift
srli t3, t3, 8
addi t2, t2, -1
and t3, t3, t2 # t3 = m
sub a0, t4, a0
or a0, a0, t3
# check overflow
xor t3, t1, t0
xor t3, t3, a0
srli t3, t3, 31
li t2, 0x7f800000
xor t4, t2, a0
and t4, t4, t3
xor a0, a0, t4
3:
# epilogue
lw ra, 0(sp)
addi sp, sp, 4
ret
fadd32
fadd32:
# prologue
addi sp, sp, -4
sw ra, 0(sp)
li t2, 0x7fffffff
and t0, a0, t2 # t0 = cmp_a
and t1, a1, t2 # t1 = cmp_b
bge t0, t1, 1f
mv t2, a0 # swap a0 = ia, a1 = ib
mv a0, a1
mv a1, t2
1:
srli t0, a0, 23
srli t1, a1, 23
andi t0, t0, 0xff # t0 = ea
andi t1, t1, 0xff # t1 = eb
li t2, 0x7fffff
and t3, a0, t2
and t4, a1, t2
addi t2, t2, 1
or t3, t3, t2 # t3 = ma
or t4, t4, t2 # t4 = mb
sub t2, t0, t1 # t2 = align
li t5, 24
bge t5, t2, 2f
li t2, 24
2:
srl t4, t4, t2 # mb >>= align
xor t2, a0, a1
srli t2, t2, 31
beqz t2, 3f
neg t4, t4
3:
add t3, t3, t4 # t3 = result of ma
# t1(eb) and t4(mb) are free to use
mv t4, a0 # t4 = ia
mv a0, t3
jal count_leading_zeros # a0 = clz
li t2, 8
blt t2, a0, 4f
sub t5, t2, a0
srl t3, t3, t5
add t0, t0, t5
j 5f
4:
sub t5, a0, t2 # t5 = shift
sll t3, t3, t5
sub t0, t0, t5
5:
li t2, 0x80000000
and a0, t2, t4
slli t0, t0, 23
or a0, a0, t0
li t2, 0x7fffff
and t3, t3, t2
or a0, a0, t3
# epilogue
lw ra, 0(sp)
addi sp, sp, 4
ret
Upon analysis, it's clear that the CPI and the number of executed instructions (Instrs. Retired) align somewhere between the values achieved with the -O2
and -O0
options in the GCC compiler. This indicates there's ample opportunity for improvement.
In the optioned Ripes CPU architecture, there is no branch prediction mechanism, and the computation of branch addresses takes place in the 3rd stage. Consequently, a branch miss incurs a significant penalty of 2 cycles, which is notably high in this case.
Based on the given scenario, loop unrolling appears to be an ideal optimization technique in this case. I have applied the unrolling method to the imul24
and idiv24
functions, both of which are extensively used in this homework assignment.
imul24:
li t3, 0 # t3 = m, r(in imul24)
1:
andi t4, t1, 0x1
neg t4, t4
and t4, t0, t4
srli t3, t3, 1
add t3, t3, t4
srli t1, t1, 1
+ andi t4, t1, 0x1 # unroll 2
+ neg t4, t4
+ and t4, t0, t4
+ srli t3, t3, 1
+ add t3, t3, t4
+ srli t1, t1, 1
+ andi t4, t1, 0x1 # unroll 3
+ neg t4, t4
+ and t4, t0, t4
+ srli t3, t3, 1
+ add t3, t3, t4
+ srli t1, t1, 1
+ andi t4, t1, 0x1 # unroll 4
+ neg t4, t4
+ and t4, t0, t4
+ srli t3, t3, 1
+ add t3, t3, t4
+ srli t1, t1, 1
+ bnez t1, 1b
idiv24:
li t3, 0 # t3 = m, r(in idiv24)
li t4, 32 # t4 = end condition
2:
sub t0, t0, t1
sltz t2, t0
seqz t2, t2
slli t3, t3, 1
or t3, t3, t2
seqz t2, t2
neg t2, t2
and t5, t2, t1 # t5 = b & -(a < 0)
add t0, t0, t5
slli t0, t0, 1
+ sub t0, t0, t1
+ sltz t2, t0
+ seqz t2, t2
+ slli t3, t3, 1
+ or t3, t3, t2
+ seqz t2, t2
+ neg t2, t2
+ and t5, t2, t1 # t5 = b & -(a < 0)
+ add t0, t0, t5
+ slli t0, t0, 1
+ sub t0, t0, t1
+ sltz t2, t0
+ seqz t2, t2
+ slli t3, t3, 1
+ or t3, t3, t2
+ seqz t2, t2
+ neg t2, t2
+ and t5, t2, t1 # t5 = b & -(a < 0)
+ add t0, t0, t5
+ slli t0, t0, 1
+ sub t0, t0, t1
+ sltz t2, t0
+ seqz t2, t2
+ slli t3, t3, 1
+ or t3, t3, t2
+ seqz t2, t2
+ neg t2, t2
+ and t5, t2, t1 # t5 = b & -(a < 0)
+ add t0, t0, t5
+ slli t0, t0, 1
- addi t4, t4, -1
+ addi t4, t4, -4
+ bnez t4, 2b
Upon analysis, it is apparent that the CPI is lower than the -O3
optimization level, yet the number of executed instructions (Instructions Retired) is higher. We still execute about 2000 more operation in this implementation. This implies that there is still some room for optimizing our assembly code. There are likely some redundanct code that can be eliminated for better efficiency.
The values exhibit slight different when compared to sinf
in the C math library. The primary deviation arises from the rounding strategy employed. Our implementation consistently utilizes rounding down, whereas IEEE 754 adheres to the "Round to nearest even" method.
However, these discrepancies are acceptable, since floating point only guarantee accuracy up to 6 to 7 significant digits.
degree | mySin | sinf (math.h) |
---|---|---|
45.0 | 0.7071068287 | 0.7071067691 |
20.0 | 0.3420201540 | 0.3420201242 |
90.0 | 1.0000035763 | 1.0000000000 |