劉益祥
The CORDIC algorithm operates using iterative shift-add arithmetic and is based on vector rotation in a plane. By sequentially rotating a vector through predefined angles, the algorithm computes trigonometric functions like sine and cosine. It avoids the need for direct multiplication or division by using only addition, subtraction, bit shifts, and table lookups.
Trying to use uint32_t
with shift and add operations to replace float
with floating-point operations in cordic algorithm.
refrence from 2024 Quiz1 Problem A
refrence from 2024 Quiz1 Problem A
modified from 2023 Quiz1 Problem A
input1 | input2 | answer | myanswer | error rate |
---|---|---|---|---|
15.961235 | 43.132904 | 59.094139 | 59.094139 | 0.000000e+00 |
36.952332 | 17.209785 | 54.162117 | 54.162117 | 0.000000e+00 |
-22.274956 | 44.483910 | 22.208954 | 22.208954 | 0.000000e+00 |
-27.907221 | -45.500877 | -73.408096 | -73.408096 | 0.000000e+00 |
-32.052887 | 31.473969 | -0.578918 | -0.578918 | 0.000000e+00 |
5.428631 | 45.491280 | 50.919910 | 50.919910 | 0.000000e+00 |
34.613480 | -40.486797 | -5.873318 | -5.873318 | 0.000000e+00 |
37.345215 | 38.584961 | 75.930176 | 75.930176 | 0.000000e+00 |
-22.956657 | -22.030470 | -44.987129 | -44.987125 | 8.479530e-08 |
12.832539 | 9.602684 | 22.435223 | 22.435223 | 0.000000e+00 |
input1 | input2 | answer | myanswer | error rate |
---|---|---|---|---|
-2.152653 | -18.543297 | 39.917278 | 39.917278 | 0.000000e+00 |
29.927048 | -44.895565 | -1343.591675 | -1343.591675 | 0.000000e+00 |
19.867577 | 34.633919 | 688.092041 | 688.091980 | 8.870202e-08 |
-13.439705 | 9.803593 | -131.757385 | -131.757385 | 0.000000e+00 |
49.927086 | 24.040535 | 1200.273804 | 1200.273804 | 0.000000e+00 |
1.543598 | 47.617355 | 73.502060 | 73.502060 | 0.000000e+00 |
-43.442417 | -24.092621 | 1046.641724 | 1046.641602 | 1.166305e-07 |
-23.320192 | -19.306118 | 450.222382 | 450.222382 | 0.000000e+00 |
44.823502 | 24.745850 | 1109.195679 | 1109.195557 | 1.100530e-07 |
-5.549877 | 11.534317 | -64.014046 | -64.014038 | 1.191831e-07 |
Noticed that when the hardware does not support floating-point operations, the compiler will translate floating-point computations into a software-supported form, similar to what I did above but with better performance.
Due to poor performance and the discovery that the mado
project uses fixed-point arithmetic for trigonometric calculations, an alternative method is adopted.
focus on the trig.c
todo : Replace the division in twin_tan()
with another method that requires fewer cycles.
Try using jemalloc
instead of division operations in twin_tan()
.
Need to know the value of the divisor in advance. Can try precomputing the divisor using fastdiv_prepare()
, storing the resulting values of m
, s1
, and s2
. When needed, retrieve these precomputed values from a lookup table for fast computation. The issue is that this approach requires a significant amount of storage space. Even if the divisor precision is limited to two decimal places, it would still require 100 × 48 bits, exceeding 4 kilobits of storage space.
table_gen()
full code
Code to generate lookup table. The original divisor had a precision of 16 bits, but due to table size considerations, only 7 bits of precision are retained.
table be generated by table_gen()
twin_tan()
use lookup table to apply jemalloc's fast division approach in the twin_tan()
function.
twin_tan()
without lookup tableDirectly using jemalloc fast division to replace the division in twin_tan()
code used to test accuracy
full code
angle | answer | myanswer | fast_div answer | error rate | fast_div error rate |
---|---|---|---|---|---|
14659 | 35398 | 35282 | 35398 | 0.003277 | 0.000000 |
-23569 | 2513772 | 2096448 | 2513772 | 0.166015 | 0.000000 |
21825 | -122226 | -231048 | -122226 | 0.890334 | 0.000000 |
-23902 | 110156 | 112652 | 112652 | 0.022659 | 0.022659 |
20873 | 45094 | 44860 | 74302 | 0.005189 | 0.647714 |
-29065 | -45094 | -44860 | -74302 | 0.005189 | 0.647714 |
-23790 | 171578 | 244916 | 171578 | 0.427432 | 0.000000 |
30660 | -6044 | -6019 | -12038 | 0.004136 | 0.991727 |
-4856 | -152984 | -240992 | -152984 | 0.575276 | 0.000000 |
17029 | 99770 | 109556 | 109556 | 0.098086 | 0.098086 |
28151 | -67372 | -66810 | -93954 | 0.008342 | 0.394556 |
22910 | 43480 | 43344 | 72468 | 0.003128 | 0.666697 |
-24624 | -4830 | -4818 | -9636 | 0.002484 | 0.995031 |
-1389 | 104600 | 111078 | 111078 | 0.061931 | 0.061931 |
10802 | 76458 | 75822 | 76458 | 0.008318 | 0.000000 |
2959 | 374556 | 516480 | 374556 | 0.378913 | 0.000000 |
-31257 | -70772 | -96172 | -70772 | 0.358899 | 0.000000 |
21628 | -340624 | -514888 | -340624 | 0.511602 | 0.000000 |
12564 | 29512 | 29442 | 53826 | 0.002372 | 0.823868 |
3480 | -90698 | -106244 | -106244 | 0.171404 | 0.171404 |
The accuracy fluctuates significantly. Even when directly using jemalloc fast division to replace the division, there are still some results with excessively large accuracy errors.
Todo : Identify the cause of low accuracy.
Use Ripes to test the cycle count.
twin_tan()
directly using the division
twin_tan_my()
using jemalloc fast division with lookup table
twin_tan_fast()
using jemalloc fast division without lookup table
The version that uses a lookup table requires significantly fewer cycles compared to the version without the lookup table, but it still takes more cycles than directly performing division.