# 2019q3 Homework1 (review) contributed by < kksweet8845 > ### Question 1 cpp #include <stdio.h> int main() { float sum = 0.0f; for (int i=0;i<10000; i++) sum += i+1; printf("Sum %f\n", sum); return 0; }  #### Thought - Wrong thought - I thought that this question is just a simple problem. - However, when jserv execute the code above, the answer was absolute different from mine. - Then I think that sum is a type of float and i is a type of int. According to IEEE 754 standard, I think that thesum,which is type of float, might ignore some trailing value when the value is quite large. #### Assumption ![](https://i.imgur.com/oGU3Axv.png) (Soucre from IEEE 754 wikipedia) When the number is quite large enough, that is, the fraction part is 0xFFFFFE, the the exponent part is just need $\ 2^{23}$. That is, \begin{gather*}0x1.FFFFFE\ \times 2^{23}\end{gather*} In here, we just want to calculate the number of digit of this value, so we take $\log$ in each side. \begin{split} (2-2^{23}+1)\ \times2^{23} &= 2^{24}-1+ 2^{23}=k\\ \log(2^{24}-1+ 2^{23}) &= \log k\\ \log(3 \cdot 2^{23}) &= \log k\\ \log k &= 7.4001\\ \log k &= 7 + 0.4001\\ k&\approx 2.53\ \times 10^{7} \end{split} It means that if the digit of value is more than 8 digits. Computer will start to ignore the trailing numbers at the end of value, so the error occurred. I assuming the value must about to ignore trailing value even I just add 1 to this number,so another value which slightly smaller than this number may also occurred rounding error. #### Experiment In order to prove my assumption, I write a code which will sum up from 1 to n and will store the digits of sum to see when the error occurred in 8 digits. Fortunately, it coincides with my assumption. bash ==============================From 1 to 5000 The value of sum_f is : 12502500.000000 The value of sum_i is : 12502500 The digits of sum_i is :8 The error : 0.000000 Normalized Error value: 0.000000 ==============================From 1 to 6000 The value of sum_f is : 18002896.000000 The value of sum_i is : 18003000 The digits of sum_i is :8 The error : 104.000000 Normalized Error value: 0.000006 ==============================From 1 to 7000 The value of sum_f is : 24502896.000000 The value of sum_i is : 24503500 The digits of sum_i is :8 The error : 604.000000 Normalized Error value: 0.000025 ==============================From 1 to 8000 The value of sum_f is : 32002896.000000 The value of sum_i is : 32004000 The digits of sum_i is :8 The error : 1104.000000 Normalized Error value: 0.000034 ==============================From 1 to 9000 The value of sum_f is : 40502896.000000 The value of sum_i is : 40504500 The digits of sum_i is :8 The error : 1604.000000 Normalized Error value: 0.000040 ==============================From 1 to 10000 The value of sum_f is : 50002896.000000 The value of sum_i is : 50005000 The digits of sum_i is :8 The error : 2104.000000 Normalized Error value: 0.000042  #### Improve accuracy Then we know that the computer will occurred this issue when we want to calculate the large floating-point number. In order to improve accuracy, jserv offer a simple solution to improve its accuracy. cpp #include <stdio.h> int main() { float sum = 0.0f, corr = 0.0f; /* corrective value for rounding error */ for (int i = 0; i < 10000; i++) { float y = (i + 1) - corr; /* add the correction to specific item */ float t = sum + y; /* bits might be lost */ corr = (t - sum) - y; /* recover lost bits */ sum = t; } printf("Sum: %f\n", sum); return 0; }  The main concept in here is to calculate the value which will be ignored in advanced and try not to calculate with large number,(t-sum)-y. That is, handling the rounding error problem, we often need to **rewrite our expression** in order to avoid the part of expression occurred ignorance when computing greatly large number. #### Quadratic formula Take a simple example, consider the familiar quadratic formula: \begin{split} \dfrac{-b\ - \sqrt{b^2-4ac}}{2a}\end{split} Now,we know that this form is inaccurate for negative b and large positive b when translated naively to a floating point computation. This expression is prone to two types of rounding error: - for negative b, cancellation between $-b$ and $\sqrt{b^2-4ac}$ - for large positive b, **overflow** in computing $b^2$ For negative b, the error is caused by cancellation at the outer substraction in the numerator $(-b - \sqrt{b^2-4ac})$. For $b^2$ much larger than $a$ or $c$, the discriminant $\sqrt{b^2-4ac}$ approximately equals $\sqrt{b^2}$. But for negative b ,$\sqrt{b^2} = -b$, so \begin{split} (-b)-\sqrt{b^2-4ac} \approx (-b) - (-b) \end{split} substracting two large values to compute a small result and causing terrible cancellation. Thus, we can rewrite the problematic substraction by applying following rule: \begin{split}x-y \rightarrow \dfrac{(x^2-y^2)}{(x+y)}\end{split} Then, it produces \begin{split} \dfrac{(-b)^2 - (\sqrt{b^2-4ac})^2}{(-b)+\sqrt{b^2-4ac}}/2a=\dfrac{4ac}{(-b)+\sqrt{b^2-4ac}}/2a \end{split} However, the expression above is not accuracy when b is positive large. Then We use another approximation rule when the value is quite large $\sqrt{1+x} = 1+\dfrac{1}{2}x$ Then, we can rewrite the expression as following: \begin{split} \dfrac{-b\sqrt{1-\dfrac{4ac}{b^2}} - b}{2a}\approx \dfrac{-b(1-\dfrac{2ac}{b^2})-b}{2a} = -\dfrac{b}{a} + \dfrac{c}{b} \end{split} Finally, we can conclude that quadratic formula can be written as following: $quadratic(a,b,c)\ = \left\{ \begin{array}{rcl} & \dfrac{4ac}{-b+\sqrt{b^2-4ac}}/2a & & {\mathrm i\mathrm f\ b<0}\\ &(-b-\sqrt{b^2-4ac})\dfrac{1}{2a} & & {\mathrm i\mathrm f\ 0 \leq b \leq 10^{127}}\\ &-\dfrac{b}{a}+\dfrac{c}{b} & & {\mathrm i\mathrm f\ b > 10^{127} }\\ \end{array} \right.$ And, I test it by myself. It actually improve the accuracy when the number is negative large. bash b = -574290360989628072255692613763742040064.000000 The original formula with no tricks 0.000000000000000000000000000000000000000000000000000000000000 The modified one 0.000000000000000000000000000000000008973251691435005568199720  It is unfortunate that I can't not see the positive one which improving accuracy. I just obtain the identical result. However, the modified expression saves a lot of calculation. bash 574290360989628072255692613763742040064.000000 The original formula with no tricks -218777280377001173981114506477345701888.00000 The modified one -218777280377001173981114506477345701888.00000  #### Pathological floating point arithmatic For computers, some simple airthmatic will cause rounding errors. Such as  0.1 + 0.2 >> 0.30000000000000000004  Because the base of 2, the floating-point number 0.1 and 0.2 cannot be represent exactly by binary form. Which will be approximate by a pattern, so that why the 4 appear at the end of number. #### IEEE standard - IEEE 854 allows $\beta$ = 10(Base 10) is how humans exchange and think about numbers. - IEEE 854 requires that if the base is not 10, it must be 2. - One of reasons is that the results of error analyses are musch tighter then $\beta$ is 2 because a rounding error is $1/2$ ulp wobbles by a factor of $\beta$ when computed as a relative error, and error analyses are almost always simpler when based on relative error. #### Example Consider $\beta$ = 16, $p$ =1 compared to $\beta$ =2, $p$ = 4. Both systems have 4 bits of significand. Consider the computation of $\dfrac{15}{8}$. - For $\beta$ =2, $p$ = 4, it can be represented as - 1.111 x $2^0$ - For $\beta$ = 16, $p$ =1, it just can be represented as - 1 x $16^0$ In general, base 16 can lose up to 2 bits, so a precision of $p$ can have an effective precision as low as $4p-3$ rather than $4p$. However, IBM choose $\beta$ = 16, $p$ = 6 for its [IBM system/370](https://en.wikipedia.org/wiki/IBM_System/370) - ==Increased exponent== : Hence the significand requires 24 bits. Since this must fit into 32 bits, this leaves 7 bits for the exponent and 1 for the sign bit. Thus,the magnitude of representable numbers ranges from about $16^{-2^6}$ to about $16^{2^6} = 2^{2^8}$ To get a similar exponent range when $\beta$ = 2 would require 9 bits of exponent, leaving only 22 bits for the significand. - ==shifting== When adding two floating-point numbers, if their exponents are different, one of the significands will have to be shifted to make the radix points line up, slowing down the operation. In the $\beta$ = 16, $p$ = 1 system, all the numbers between 1 and 15 have the same exponent, so no shifting is reqired when adding any of the ${15}\choose{2}$ = 105 possible pairs of distinct numbers from this set.In the $\beta$ = 2, $p$ = 4 system, however, these numbers have exponents ranging from 0 to 3, and shifting is required for 70 of the 105 pairs. #### Special Quantities - On some floating-point hardware every bit pattern represents a valid floating-point number. The IBM System/370 is an example of this. On the other hand, the VAX reserves some bit patterns to represent special numbers called **reserved operands**. - IEEE standard continues in this tradition and has NaNs(Not a Number) and infinites. Without special quantities, there is no good way to handle exceptional situations like taking the square root of a negative number other than aborting computation. So the IEEE standard specifieds the following special values: $\pm 0, denormalized \ num, \pm \infty \ and \ NaNs$ |Exponent|Fraction|Represents| |:----:|:----:|:----:| |$e = e_{min} -1$|$f=0$|$\ \pm0$| |$e = e_{min} -1$|$f\ne0$|$0.f\times2^{e_{min}}$| |$e_{min} \leq e \leq e_{max}$|all|$1.f \times 2^e$| |$e = e_{max} +1$|$f=0$|$\pm\infty$| |$e = e_{max} +1$|$f \ne 0$| NaN| #### Exception, Flags and Trap Handlers Exception - When an exceptional condition like division by zero or overflow occurs in IEEE arithmetic, the default is to deliver a result and continue. Typical of the default results are Nan for 0/0 and $\sqrt{-1}$ and $\infty$ for 1/0 and overflow. Flags - The flags are ''sticky'' in that once set, they remain set until explicitly cleared. Such like testing the flags is the only way to distinguish 1/0, which is a genuine infinity from an overflow. Trap Handlers - Sometimes continuing executing in the face of exception conditions is not appropriate. Here is a example - $x/(x^2+1)$, when $x> \sqrt{\beta}\ \beta^{e_{max}/2}$, the denominator is infinite, resulting in a final answer of 0, which is totaly wrong. - Although rewriting the expression as $1/(x+x^{-1})$ may solve the problem above, rewriting may not always solve the problem. - The IEEE standard strongly recommends that implementations allow trap handlers to be installed. |Exception|Result When Traps Disabled|Argument to Trap Handler| |:----:|:----:|:----:| |Overflow|$\pm\infty \ or \ \pm x_{max}$|Round($x2^{-\alpha}$)| |Underflow|0, $\pm2^{min} \ or \ denormal$|Round($x2^{\alpha}$)| |Divide by zero|$\pm\infty$|Operands| |Invalid|NaN|Operands| |Inexact|round($x$)|round($x$)| - One obvious use for trap handlers is for backward compatibility. - This is especially useful for codes with a loop like below. Since comparing a NaN to a number with relative operands always returns false, this code will go into an infinite loop if x never becomes an NaN cpp do { // Statement } while (x >= 100)  - Another more interesting use for trap handlers that comes up when computing products such as $\prod_{i=1}^{n} x_i$ that could potentially overflow. One solution is to use logarithms and compute $exp(\sum logx_i)$ instead. #### System aspects The design of almost every aspect of a computer system requires knowledge about floating point. As an example of how plausible design decisions can lead to unexpected behavior, consider the following BASIC program cpp float q = 3.0/7.0; if (q == 3.0/7.0) printf("Equal"); else printf("Not Equal");  - When compiled and ran on my labtop, the program prints ==Not Equal==! - Incidentally, some people think that solution to such anomalies is never to compare floating-point numbers for equality but instead to consider them equal if they are within some error bound E. - This is hardly a cure all, because it raises as many questions as it answers. What should the value of E be? If x < 0 and y > 0 are within E, should they be considered equal, even though they have different signs? Futhrermore, the relation defined by this rule, $a \sim b \Rightarrow |a-b| < E$, is not an equivalence relation because $a \sim b$ and $b \sim c$ do not imply that $a \sim c$ #### Instruction Sets - It is common for an algorithm to require a short burst of higher precision in order to produce accurate results. - Take a previous example of quadratic formula. The computation of $b^2-4ac$ in double precision when each of the quantities a, b, and c are in single precision is easy if there is a multiplication instruction that takes two single precision numbers and produces a double precision result. - Thus, hardware to compute a double-precision product from single precision operands will normally be only a little more expensive than a single-precision multiplier and much less expensive than a double-precision mulitplier. - This instruction has many other uses, not only for quadratic formula. - Consider the problem of solving a system of linear equations: \begin{split} a_{11}x_1 + a_{12}x_2+ \dots + a_{1n}x_n &= b_1 \\ a_{21}x_1 + a_{22}x_2+ \dots + a_{2n}x_n &= b_2 \\ \dots \\ a_{n1}x_1 + a_{n2}x_2+ \dots + a_{nn}x_n &= b_n \end{split} which can be written in matrix form as $Ax = b$, where \begin{split}A=\begin{bmatrix} a_{11} & a_{12} & a_{13} & \dots & a_{1n} \\ a_{21} & a_{22} & a_{23} & \dots & a_{2n} \\ \dots & \dots & \dots & \dots & \dots \\ a_{n1} & a_{n2} & a_{n3} & \dots & a_{nn} \end{bmatrix}\end{split} Suppose a solution $x^{(1)}$ is computed by some method, perhaps Gaussian elimination. There is a simple way to improve the accuracy of the result called ==iterative improvement==. First compute \begin{split} \xi = Ax^{(1)} - b \end{split} Then solve the system \begin{split} Ay = \xi \end{split} That is, \begin{split} Ay \approx \xi \approx Ax^{(1)}-b = A(x^{(1)}-x) \end{split} Then, we can repreadted replacing $x^{(1)}$ with $x^{(2)}$ and $x^{(2)}$ with $x^{(3)}$. This argument that $x^{(i +1)}$ is more accurate than x^{(i)} is only informal. - When performing iterative improvement, $\xi$ is a vector whose elements are the difference of nearby inexact floating-point numbers and so can suffer from catastrophic cancellation. Thus, iterative improvement is not very useful unless $\xi = Ax^{(1)} -b$ is computed in double precision. Once again, this is a case of computing the product of two single-precision numbers, where the full double-precision result is needed. #### Extended source - [Automatically Improving Accuracy for Floating Point Expressions](https://herbie.uwplse.org/pldi15-paper.pdf) - [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf) :::info TODO: check [Pairwise summation](https://en.wikipedia.org/wiki/Pairwise_summation) as well. > In numerical analysis, pairwise summation, also called cascade summation, is a technique to sum a sequence of finite-precision floating-point numbers that substantially reduces the accumulated round-off error compared to naively accumulating the sum in sequence.[1] Although there are other techniques such as Kahan summation that typically have even smaller round-off errors, pairwise summation is nearly as good (differing only by a logarithmic factor) while having much lower computational cost—it can be implemented so as to have nearly the same cost (and exactly the same number of arithmetic operations) as naive summation. :notes: jserv ::: ### Question 2 cpp #include <stdio.h> #define align4(x) (((x) + K) & (-4)) int main(void) { int p = 0x1997; printf("align4(p) is %08x\n", align4(p)); return 0; }  #### Thought - At first, I gave the wrong answer to this question. At that time, I just want to find a number which can make the desired output,0x1998, given 0x1997. - After, however, I start to think this question through cautiously. - According to jserv description of this macro, this can round up the address to a alignment address. - That is, for every 4-byte, I need to round extra one to this address. - If using math expression to express it, it will be: \begin{split} \lceil p \rceil_{(4)} \end{split} It make me consider the rule of remainder. \begin{split} \forall m \in \mathbb N, m \in \{ 4n+p| n \in \mathbb N ,0 \leq p \leq 3 \} \end{split} The address can be expressed into one of four potential forms,that is shown as following. And the $4n$ don't need to change, while for $4n +1, 4n+2, 4n+3$ need to round up to $4(n+1)$ \begin{split} &4n &\Rightarrow 4n \\ &4n+1 &\Rightarrow 4(n+1) \\ &4n+2 &\Rightarrow 4(n+1) \\ &4n+3 &\Rightarrow 4(n+1) \end{split} Applying simple math,we add 3 to every line,then we get: \begin{split} &4n\ +&(3)&\Rightarrow 4(n) + 3\\ &4n+1+&(3) &\Rightarrow 4(n+1) + 0\\ &4n+2+&(3) &\Rightarrow 4(n+1) + 1 \\ &4n+3+&(3) &\Rightarrow 4(n+1) + 2 \end{split} Finally, we just need to ignore the remainder. In here, jserv use a celever technique, & -4. Because -4 in binary form is 0xFFFFFFFC(32 bits), the result of operating x & -4 will become the original value with two trailing zero.That is, cpp= 0xFFFFFFF6 & (-4) => 0xFFFFFFF4 0xFFFFFFF8 & (-4) => 0xFFFFFFF8 0xFFFFFFF7 & (-4) => 0xFFFFFFF4  We can see that the least bytes will becomes the multiple of 4. Requiring to round up to nearest multiple of 4-bytes number. Just add '3' cpp (0xFFFFFFF6 + (3)) & (-4) => 0xFFFFFFF8 (0xFFFFFFF8 + (3)) & (-4) => 0xFFFFFFFC (0xFFFFFFF7 + (3)) & (-4) => 0xFFFFFFF8  #### In linux kernel also have macro of ALIGN In [arch/riscv/include/asm/page.h](https://github.com/torvalds/linux/blob/master/arch/riscv/include/asm/page.h) :::danger Add hyperlink to Linux kernel source tree (via GitHub or LXR). e.g. [include/linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h) :notes: jserv ::: cpp /** * REPEAT_BYTE - repeat the value @x multiple times as an unsigned long value * @x: value to repeat * * NOTE: @x is not checked for > 0xff; larger values produce odd results. */ /* @a is a power of 2 value */ #define ALIGN(x, a) __ALIGN_KERNEL((x), (a)) #define ALIGN_DOWN(x, a) __ALIGN_KERNEL((x) - ((a) - 1), (a)) #define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask)) #define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a))) #define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)  In [tools/testing/scatterlist/linux/mm.h](https://github.com/torvalds/linux/blob/master/tools/testing/scatterlist/linux/mm.h) cpp #define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1) #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))  #### __ALIGN_KERNEL_MASK(x, mask) cpp #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))  The linux developer writes the general formula of rounding-up macro. mask+1 is restircted to be the multiple of 2. #### __ALIGN_KERNEL(x, a) cpp #define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)  And, a more advanced one which casts the type of a depending on x. #### ALIGN(x, a) cpp #define ALIGN(x, a) __ALIGN_KERNEL((x), (a))  More simply one for the name. #### Extended source - [align macro kernel stack overflow](https://stackoverflow.com/questions/13122846/align-macro-kernel) - [Linux kernel-5.3.0](https://www.kernel.org/) ### Question 3 cpp #include <stdbool.h> bool func(unsigned int x) { return x && ((x & (~x + 1)) == x); }  The code above can be rewritten by the following equvalent code. cpp bool func(unsigned int x) { unsigned int unknown = 0; while (x && unknown <= 1) { if ((x & Z) == 1) unknown++; x >>= 1; } return (unknown == 1); }  - The answer of Z is 1, because it is obvious that the code is checking every bit which is oen or not. And any x which has more than and equal two bits will return false. - It is a function that can check the number x whether there is only one bit in x or not. #### Thought - In the beginning, I couldn't not figure out what will happen if the x and the complement of x AND together. I tested it by hand with all value in 1 bytes. Then I found that |x|x & (~x + 1)|(x & (~x + 1)==x)| |:-----:|:----:|:---:| |0000 |0000 (with overflow)|0| |0001 |==0001==|1| |0010 |==0010==|1| |0011 |1101|0| |0100 |==0100==|1| |0101 |1001|0| |0110 |0010|0| |0111 |0001|0| |1000 |==1000==|1| |1001 |0001|0| |1010 |0010|0| |1011 |0001|0| |1100 |0100|0| |1101 |0001|0| |1110 |0010|0| |1111 |0001|0| Obviously, The number x which will be identical after the operation is that having only 1 set bit in binary form. #### Compute the minimum (min) or maximum (max) of two integers without branching Here is a approch to realize the comparison. cpp int x; // we want to find the minimum of x and y int y; int r; // the result goes here r = y ^ ((x ^ y) & -(x < y)); // min(x, y)  ##### Description Consider the (x<y) shown above cpp int x = 5,y = 9; printf("-(x<y) : %x\n", -(x<y)); // 0xffffffff x = 10, y = 9; printf("-(x<y) : %x\n", -(x<y)); // 0  - According to the result above, - If x<y is true, then -(x<y) will be all ones, so (x^y) & ~0 will be (x^y). That is,r = y ^ (x^y) = x - If x<y is false, then -(x<y) will be all zeros, so (x^y) & 0 will be 0.That is,r = y ^ 0 = y ##### Maximum version cpp int x; // we want to find the minimum of x and y int y; int r; // the result goes here r = x ^ ((x ^ y) & -(x < y)); // min(x, y)  The expression above is obvious that only changing the y to x. ##### Application - On some rare machines where branching is very expensive and no condition move instructions exist, the above expression might be faster than the obvious approch, r = (x<y) ? x : y, even it involves two more instructions. Hoever, on some machines, evaluating (x<y) as 0 or 1 sill requires a branch instruction, so there maybe no advantage. Quick and dirty versions: We know that INT_MIN <= x -y <= INT_MAX, then we can use the following, which are faster because (x-y) only needs to be evaluated once. cpp r = y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y) r = x - ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // max(x, y)  However, This version still risks the overflow if x-y is not in the intervel. We need to set a precondition, that is, INT_MIN <= x -y <= INT_MAX #### Extend source - [How to get absolute value as fast as possible](https://community.arm.com/developer/ip-products/processors/f/cortex-m-forum/6636/how-to-get-absolute-value-of-a-32-bit-signed-integer-as-fast-as-possible) - [Bit Twidding hacks](https://graphics.stanford.edu/~seander/bithacks.html#ParityNaive) - [How to swap two numbers without thirds variable](https://www.geeksforgeeks.org/swap-two-numbers-without-using-temporary-variable/) - [McBits: fast constant-time code-based cryptography](https://cryptojedi.org/papers/mcbits-20130616.pdf) - [Side Channel Attack](https://www.sciencedirect.com/topics/computer-science/side-channel-attack) - [Dude, is my code constant time?](https://eprint.iacr.org/2016/1123.pdf) - [Algorithmic problem solving: How to efficiently compute the parity of a stream of numbers](https://www.freecodecamp.org/news/algorithmic-problem-solving-efficiently-computing-the-parity-of-a-stream-of-numbers-cd652af14643/)