[Toc]
# arithmetic for computers
some basic concept of arithemetic logic unit (ALU), starting from 1-unit ALU,


then we could combine them to build an 32-bits ALU.

---
Well, we could also add more functions and finally draw a simple figure to present an ALU.



with sample VHDL behavioral defination of a RISC-V ALU
```verilog
module RISCVALU(ALUctl, A, B, ALUOut, Zero);
input [3:0] ALUctl;
input [31:0] ALUOut;
output reg [31:0] ALUOut;
output Zero;
assign Zero = (ALUOut==0); // Zero is true if ALUOut is 0
always @(ALUOut, A, B) begin
case(ALUctl)
4'b0000: ALUOut <= A & B;
4'b0001: ALUOut <= A | B;
4'b0010: ALUOut <= A + B;
4'b0110: ALUOut <= A - B;
4'b0111: ALUOut <= A < B ? 1 : 0;
4'b1100: ALUOut <= ~(A | B); // NOR
default: ALUOut <= 0;
endcase
end
endmodule
```
## addition & subtraction
addition and subtraction
addition: ```a+b = (a^b) | ((a&b) << 1)```
substraction: ```a-b = a+(-b) = a+((~b)+1)```
in RISC-V, developers didn't include overflow ckeck on integer arithmetic operations in the base instruction set; howerer, many overflow checks can be cheaply implemented using RISC-V branches, such as:
```
add t0, t1, t2;
bltu t0, t1, overflow;
```
which is an unsigned addition overflow check.
| operation | operand A | operand B | result indicating overflow |
| --------- | --------- | --------- | -------------------------- |
| $A+B$ | $\ge 0$ | $\ge 0$ | $< 0$ |
| $A+B$ | $< 0$ | $< 0$ | $\ge 0$ |
| $A-B$ | $\ge 0$ | $< 0$ | $< 0$ |
| $A-B$ | $< 0$ | $\ge 0$ | $\ge 0$ |
## miltiplication
naive way to do multiplication:

multiplication hardware:


multiplication algorithm:

disadvantage:
1. waste the space of multiplicand.
2. since the multiplicand is 64-bit, 64-bit ALU is required. however, the 64-bit ALU only produce 32-bit data.
---
optimized multiplier: perform steps in parallel - add/shift
==EXAMPLE==
Using 4-bit numbers to save space, multiply $2_{10} \times 3_{10}$.
signed multiplication

faster multiplication by divided addition

## division
naive way to do divison:

divison hardware:

divison algorithm:

*dividend and divisor are both 32-bit.
==EXAMPLE==
using a 4-bit version of the algorithm to save pages, let’s try $7_{10} \div 2_{10}$


## floating point
defined by IEEE standard 754, the notation in __normalized binary__ (which computer represents in base 2) is $\pm 1.xxxxx_{2} \times 2^{yyyy}$, with two representations: single precision (32-bits) and double precision (64-bits).
IEEE Floating-Point Format:
==$x = (-1)^s \times (1+F) \times 2^{E-Bias}$==
* $s$: sign bit, $0$ means non-negative and $1$ means negative,
* normalize signficand: $1.0 \le |signficand| \le 2.0$, there always has a leading pre-binary-point 1 bit, so no need to represent it explictly. $F$ is the fraction with the "$1.$" restored.
* $E$: excess representation: actual exponent + bias, ensures exponent is unsigned, for using $127$($-128$ reserved) as a bias for single precision and $1023$($-1024$ reserved) for double precision
single precision (32-bits)

double precision (64-bits)

single-precision range:
exponent `0000 0000` and `1111 1111` are reserved
the smallest value is $\pm 1.0 \times 2^{00000001_2 - 127_{10}} = \pm 1.0 \times 2^{-126}$, which fraction is $000...00$ with 23-bits.
the largest value is $\pm 1.11111111111111111111111_2 \times 2^{11111110_2 - 127_{10}} = \pm 1.11111111111111111111111_2 \times 2^{254-127}$
as well as, the relative precision of single and double:
single: $23 \times \log_{10}2 \approx 23 \times 0.3 \approx 6$
double: $52 \times \log_{10}2 \approx 52 \times 0.3 \approx 16$
__denormal numbers__: $x = (-1)^s \times (0+F) \times 2^{-Bias}$, which is smaller than normal numbers
__zero__: occurs when both _exp_ and _mantissa_ are zero
$x = (-1)^s \times 0$ (the sign is matter, which $+0.0$ and $-0.0$ are represents 0.0)
__infinity__: occurs when the bits of _exp_ are all one and _mantissa_ equals to zero
$x = (-1)^s \times \infty$ (the sign is matter)
__NaN__: occurs when the bits of _exp_ are all one and _mantissa_ doesn't equal to zero
$x = NaN$ (the sign isn't matter)

---
floating addition algorithm:

==EXAMPLE==
What's decimal number does the floating number `0x427D0000` represented by IEEE 754 standard?
$427D0000_{16} = 01000010011111000000000000000000_2$
the value is $(-1)^0 \times (1+F) \times 2^{E-Bias}$, which the F is $1111100...000_2$ and the E is $10000100_2 = 132_{10}$. Hence the decimal number is $1.11111_2 \times 2^{132-127} = 1.11111 \times 2^5$
$11111_2 = 31_{10}$
==EXAMPLE==
add $0.5_{10}$ and $-0.4375_{10}$
assume we keep 4 bits of precision:
$0.5_{10} = 0.1_2 = (1 \times 2^{-1})_2$
$-0.4375_{10} = (-7/16)_{10} = (-1.11 \times 2^{-2})_2$
align the number whose exp is bigger:
$-1.11 \times 2^{-2} = -0.111 \times 2^{-1}$
add together:
$(1 \times 2^{-1}) + (-0.111 \times 2^{-1}) = 0.001 \times 2^{-1}$
normalize:
$0.001 \times 2^{-1} = 1 \times 2^{-4} = 0.0625_{10}$
==EXAMPLE==
multiplying $0.5_{10}$ and $-0.4375_{10}$
$0.5_{10} = 0.1_2 = (1 \times 2^{-1})_2$
$-0.4375_{10} = (-7/16)_{10} = (-1.11 \times 2^{-2})_2$
$(-1)+(-2) = (-3) = 124 - 127$, so exp is $124$
the fraction is:
$1.000 \times 1.110 = (1000 \times 1110) \times 0.000001 = 1110000 \times 0.000001 = 1.110000$, round, $1.11$
the product is $(1.110 \times 2^{124-127})_2 = 1.75/8_{10} = 0.21875$
with the sign, the answer is $-0.21875_{10}$
---
fallacy: _right shift by n-bit is equal to divided by $2^n$ and left shift by n-bit is equal to multiple by $2^n$_
when the number to be operate is <font color=#f00>__unsigned__</font>, the conclusion is __TRUE__.
but when the number is signed, the conclusion is wrong.
consider -5 divided by 4:
in 8 bits binary, -5 is `0b11111011` when -5 divided by 4, we expect the outcome shall be -1. but `0b11111011 >> 2` = `0b00111110` which is 62 instead, absoutely wrong.
maybe try shift with sign extend? `0b11111011 >> 2` = `0b11111110` which is -2, close, but still wrong.
```cpp=
#include <iostream>
int main(){
std::cout << ((-5) >> 2) << std::endl; return 0;
}
```

pitfall: _addition in float is <font color=#f00>__associative__</font>_
floating-point numbers have limited precision and result in approximations of real results.
consider $a = 1.5 \times 10^{38}, b = 1.0, c = -1.5 \times 10^{38}$, whether the result of `(c+a)+b` and `c+(a+b)`?
```cpp=
#include <iostream>
using namespace std;
int main(){
float a = 1.5e38, b = 1.0, c = -a;
cout << (c+a) + b << endl;
cout << c + (a+b) << endl;
return 0;
}
```
