--- tags: Computer Architecture 2021 --- # Lab1: RV32I Simulator ###### tags: ``RISC-V``, ``Computer Architecture 2021`` ## Sqrt(x) > Problem description of [LeetCode 69](https://leetcode.com/problems/sqrtx/). Given a non-negative integer `x`, compute and return the square root of `x`. Since the return type is an integer, the decimal digits are **truncated**, and only the integer part of the result is returned. **Note:** You are not allowed to use any builti-in exponent function or operator, such as `pow(x, 0.5)` or `x ** 0.5`. ## Solution Basically this problem asks us to find $y$ of $y=\lfloor\sqrt{x}\rfloor$ with given variable $x$. where $x$ is a non-negative integer that ranges between $[0, 2^{31}-1]$. Since we already know the value $y$ has been square rooted. We can infer than the value of $y$ is not going to be bigger than $\lfloor\sqrt{2^{31}-1}\rfloor$ which is $46340$. So the searching domain of $y$ is $[0, 46340]$. Usually, this kind of one-dimension searching problem with its problem domain following a strong value order can be solved by **binary search**. Which we have to determine the upper bound and lower bound of the current searching range. And swallow half of the searching range by comparing where the middle point belongs to (upper range or lower range). **My solution** exploits the fact that we are looking for an integer, and the relationship between $x$ and $y$ follows a strict order. So we can just enumerate each bit to find the integer that is closest to $\sqrt{x}$. Doing so offers much cleaner code. ### Algorithm in Action ![](https://i.imgur.com/guEXCL8.png) ## Implementation ### C Soltution ```c #include <stdio.h> int mySqrt(int x){ unsigned int s = 0; for (unsigned int i = (1 << 15);i > 0;i >>= 1) if ((s+i) * (s+i) <= x) s += i; return s; } int main() { printf("%d\n", mySqrt(2147483647)); return 0; } ``` :::info The function prototype and its body: ```cpp int mySqrt(int x) { long lo = 0, hi = x + 1u; while (lo < hi) { long mi = (hi - lo) / 2 + lo; if (mi * mi <= x) lo = mi + 1; else hi = mi; } return lo - 1; } ``` :notes: jserv ::: ### RISC-V Assembly Solution ```c # https://leetcode.com/problems/sqrtx/ # Calculate floor(sqrt(x)) .data x: .word 2147483647 .text main: # call floor(squrt(x)) lw a0, x # load argument x jal ra, floor_sqrt # call floor_sqrt(x) # print result mv a0, a0 # integer to print li a7, 36 # print int environment call (1) ecall # print int environment call # exit li a7, 10 # exit environment call id (10) ecall # exit environment call floor_sqrt: li t0, 0 # unsigned int s = 0; li t1, 1 # unsigned int i = 1; slli t1, t1, 15 # i = i << 15 for_next_bit: beqz t1, next_bit_end # if i == 0 goto next_bit_end add t2, t0, t1 # t2 = s + i mul t3, t2, t2 # t3 = (t2)^2 = (s+i)^2 bltu a0, t3, if_x_is_less_than_t3 # if a0 < t3 then don't add add t0, t0, t1 # s = s + i if_x_is_less_than_t3: srli t1, t1, 1 # i = i >> 1 j for_next_bit # goto for_next_bit next_bit_end: mv a0, t0 ret ``` :::warning TODO: what if your processor lacks of `MUL` instruction? :notes: jserv ::: :::info Given two number $x$ and $y$, if we want to calculate the $x\times y$ value without `mul`. We can break apart one of the number ($y$) into a combination of $2^n$ bits. And perform the `mul` operation by `sll` and `add`. $$ 3_{10} \times 74_{10} = 3_{10} \times (2^6+2^3+2^1) = 11_2 \times (1000000_2 + 1000_2 + 10_2) \\ = (11_2 << 6) + (11_2 << 3) + (11_2 << 2) $$ ::: ### Compiled Assembly and Memory Following snippet contains the actual assembly code to execute. ``` 00000000 <main>: 0: 10000517 auipc x10 0x10000 4: 00052503 lw x10 0 x10 8: 018000ef jal x1 24 <floor_sqrt> c: 00050513 addi x10 x10 0 10: 02400893 addi x17 x0 36 14: 00000073 ecall 18: 00a00893 addi x17 x0 10 1c: 00000073 ecall 00000020 <floor_sqrt>: 20: 00000293 addi x5 x0 0 24: 00100313 addi x6 x0 1 28: 00f31313 slli x6 x6 15 0000002c <for_next_bit>: 2c: 00030e63 beq x6 x0 28 <next_bit_end> 30: 006283b3 add x7 x5 x6 34: 02738e33 mul x28 x7 x7 38: 01c56463 bltu x10 x28 8 <if_x_is_less_than_t3> 3c: 006282b3 add x5 x5 x6 00000040 <if_x_is_less_than_t3>: 40: 00135313 srli x6 x6 1 44: fe9ff06f jal x0 -24 <for_next_bit> 00000048 <next_bit_end>: 48: 00028513 addi x10 x5 0 4c: 00008067 jalr x0 x1 0 ``` And the initial memory snapshot. | ![memory snapshot of my program, location 0x10000000 contains a word with value 0x7fffffff](https://i.imgur.com/RTzjFQo.png) | | --- | | <center>(Figure 1). the inital memory snapshot of my program </center> | Judging from Figure 1, we can observed that there is only one word of datum in the program. Located in `0x10000000`. It represents the value of `x` in $\lfloor\sqrt{x}\rfloor$. In this run I set this value to the maximum of a integer, which is $2^{31}-1=2147483647_{10}=\text{0x7FFFFFFF}_{16}$. ## Assembly Code Logic The rest of this section contains two parts. Each part shows one aspect of the program and describes how this assembly code make sense to the problem we are solving. ### Execution 1: Main ``` 00000000 <main>: 0: 10000517 auipc x10 0x10000 4: 00052503 lw x10 0 x10 ``` In the `main` routine. The first thing we do is calling another routine `floor_sqrt`, which contains the main logic for solving this problem. To call this routine, first thing we need is place the function argument, which is the value of $x$. To do so we have to load the value of $x$ from the memory. Since the instruction size of `lw` is 32 bits. It is impossible to places the whole memory address into one instruction(doing so will make no place for opcode and other things to represent). To solve this, the memory loading instruction has been break apart into two instructions. 1. `auipc x10 0x10000`, append the given immediate value `0x10000` with 12 more zero bits, add this value with the value of `pc` register. Place the result into `x10` register. Doing so we now obtain the value `0x10000000` for register `x10`, the value `0x10000000` is actually the address of $x$ in memory, we can realize this by looking at Figure 1. 2. `lw x10 0 x10`, this instruction calculate the memory address by adding the content of register `x10` with an immediate offset value `0`. Once the calculation is done, load the memory word at that address into register `x10`. Once the process is done, we palce the value of $x$ into register `x10`. ``` 8: 018000ef jal x1 24 <floor_sqrt> # .... # .... 00000020 <floor_sqrt>: 20: 00000293 addi x5 x0 0 24: 00100313 addi x6 x0 1 ``` The next thing we do is perform the `jal`(jump and link) instruction. Which set the `pc` to the given instruction address(calculated by `pc` + immediate value, which is $8_{16}+24_{10}=32_{10}=20_{16}$), and preserve the return address to register `ra`. According to the [RISC-V Assembly Programmer's Handbook](https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf), the argument registers are `a0` and `a1`. In fact, the `a0` is just an alias to `x10`, which is the register we are dealing with all along. So now the value of $x$ stored in `x10` will become the argument of this routine call of `floor_sqrt`. ``` 8: 018000ef jal x1 24 <floor_sqrt> c: 00050513 addi x10 x10 0 10: 00100893 addi x17 x0 36 14: 00000073 ecall 18: 00a00893 addi x17 x0 10 1c: 00000073 ecall ``` Once the call routine is finished, we will continue the execution at `0xc`, and at this moment the register `a0`, which is the alias of `x10`. Will contains the return value from routine `floor_sqrt`. Which is the $y$ value in term of given $x$. After that we are going to output the result to somewhere. Here we are using the RISC-V environment call. > The ECALL instruction is used to make a service request to the execution environment. The EEI will define how parameters for the service request are passed, but usually these will be in defined locations in the integer register file. > [name=The RISC-V Instruction Set Manual Volume I: Unprivileged ISA Ch 2.8] According to the execution environment we are using - Ripes, It define the following ecalls. | `a7` | `a0` | *Name* | *Description* | | ---- | ------------------- | -------------- | ------------------------------------------------------------ | | 1 | (integer to print) | print_int | Prints the value located in `a0` as a signed integer | | 2 | (float to print) | print_float | Prints the value located in `a0` as a floating point number | | 4 | (pointer to string) | print_string | Prints the null-terminated string located at address in `a0` | | 10 | - | exit | Halts the simulator | | 11 | (char to print) | print_char | Prints the value located in `a0` as an ASCII character | | 34 | (integer to print) | print_hex | Prints the value located in `a0` as a hex number | | 35 | (integer to print) | print_bin | Prints the value located in `a0` as a binary number | | 36 | (integer to print) | print_unsigned | Prints the value located in `a0` as an unsigned integer | | 93 | (status code) | exit | Halts the simulator and exits with status code in `a0` | To print out an unsigned integer, we have to set `a7` to `36` and set `a0` to the integer value. After that execute the `ecall` instruction. ``` c: 00050513 addi x10 x10 0 10: 00100893 addi x17 x0 36 14: 00000073 ecall ``` After print out the value, we now stop the program by execute `ecall` with `a7` set to `10` ``` 18: 00a00893 addi x17 x0 10 1c: 00000073 ecall ``` |![The execution result. The console window showing one number 46340 and saying 'Program exied with code: 0'](https://i.imgur.com/gdLeHmV.png)| |---| | <center>(Figure 2) Showing the result with $x=2147483647$</center>| ### Execution 2: Floor Square ``` 00000020 <floor_sqrt>: 20: 00000293 addi x5 x0 0 24: 00100313 addi x6 x0 1 28: 00f31313 slli x6 x6 15 0000002c <for_next_bit>: 2c: 00030e63 beq x6 x0 28 <next_bit_end> 30: 006283b3 add x7 x5 x6 34: 02738e33 mul x28 x7 x7 38: 01c56463 bltu x10 x28 8 <if_x_is_less_than_t3> 3c: 006282b3 add x5 x5 x6 00000040 <if_x_is_less_than_t3>: 40: 00135313 srli x6 x6 1 44: fe9ff06f jal x0 -24 <for_next_bit> 00000048 <next_bit_end>: 48: 00028513 addi x10 x5 0 4c: 00008067 jalr x0 x1 0 ``` > Suddenly realize I don't have to explain my code. I will stop here. 👁️👄👁️ > [name=Zheng-Xian Li] ## Processor Pipeline |![The circuit diagram of Ripes' 5 Stage RISC-V Processor](https://i.imgur.com/Yyqb3QK.png)| |---| | <center>(Figure 3) The circuit diagram of Ripes' 5 Stage RISC-V Processor </center> The processor in Figure 3 follows the [Instruction Pipeline Design](https://en.wikipedia.org/wiki/Instruction_pipelining). This pipeline consists of 5 stages. 1. Instruction Fetch(IF) 2. Instruction Decode and Register Fetch(ID) 3. Execute(EX) 4. Memory Access(MEM) 5. Register Write Back(WB) ### Instruction Fetch (IF) | ![](https://i.imgur.com/aDOVlwJ.png) | |---| | <center>(Figure 4) A closer look of IF stage (The processor is about to execute the first instruction) </center> | The Instruction Fetch stage focus on fetching instruction. The most important part is the `PC` Counter `(0)`. It keeps track of the instruction we are going to excute. Basically in the Ripes' setup, we will always start execution from the `0x0`. So in this diagram we can see that the output `(1)` of `PC` counter is `0x00000000`. The Instruction Memory Block at `(2)` is where we store all the instructions, the diagram block consist two IO path. `addr` and `inst`. `addr` accept the output from the `PC` counter output `(1)`, which is the instruction we suppose to execute next. The output port `instr` at `(3)` will send out the corresponding instruction binary at address given by input port `addr`. In the Figure 4 we are about to execute the first instruction `auipc x10 0x10000` so the corresponding instruction binary is `0x10000517` at `(3)`. Above the instruction memory block `(2)`, there is adder `(4)`. This adder is calculating the next instruction we usually going to execute next. The calculation is very simple, just increase current `PC` counter `(1)` by $4_{10}$ `(6)`. The number $4$ represents the instruction size (All instruction in RISC-V is $4$ bytes long). The result will be Backpropagated to the multiplexer `(8)`. In Figure 4 we can observed that `PC` counter is `0x0`. So the next instruction we are going to execute is $0_{16}+4_{16}=4_{16}$, so the signal at `(7)` showing `0x00000004`. There is another input `(9)` for multiplexer `(8)`. It is used by other instructions like `jal`, `j`, `beg` and others. To set `PC` counter to a specific location when certain condition met. | ![](https://i.imgur.com/jBOyxJn.png) | | --- | | <center>(Figure 5) `jal` instruction set `PC` counter to specific location</center> | In Figure 5 we can see the `EX` stage is executing `jal` instruction, which make control flow move to specific location `0x00000020`. instead of the next instruction `0x00000010 + 4 = 0x00000014`. ### Instruction Decode and Register Fetch (ID) | ![](https://i.imgur.com/cLHR9Nb.png) | | --- | | <center>(Figure 6) ID stage, still executing the first instruction `auipc x10 0x10000`</center> | In this stage, the main task is to decode the compact instruction format into a much logical, much easier to use interface. This will make further stage design much easier. * `(0)` contains the instruction in binary format. Here corresponding representation of `the auipc x10 0x10000` is `0x10000517`. * `(1)` the PC counter value of this instruction. This value is propagated from IF stage. * `(2)` the address of next instruction. This value is propagated from IF stage. * `(9)` is the register 1 index in instruction * `(10)` is the register 2 index in instruction * `(11)` the opcode of this instruction. In this example it will be `AUIPC` * `(12)` is the destination register in instruction. In Figure 6. Since the destination register is `x10`, so the output of `(12)` become `0x0a`. * `(13)` is the immediate value `(4)` of this instruction. With the instruction `auipc x10 0x10000`, the immediate value will be the given value `0x10000` shift 12 bits to left. As a result the output of `(13)` become `0x10000000`. We can observed that how the immediate value is calculated is totally based on the opcode `(11)`. So it is part of the input of this block `(4)`. * `(5)` The fattest block in Figure 6 - `Registers`. It stored all the registers value. * `(7)` the resolved register value of `(9)` register id 1. * `(14)` the resolved register value of `(10)` register id 2. * `(6)` and `(8)` is the signal backpropagated from further stage `WB`. If the multiplexer selector(sorry it is blocked by the opcode label) at block `(5)` is set to `Wr`. These signal `(6)` and `(8)` will be used. ### Execute (EX) | ![](https://i.imgur.com/5Mteyec.png) | | --- | | <center>(Figure 7) The EX stage, which perform the actual calculation by ALU unit.</center> | This stage contains the ALU unit, which going to perform the acutal arthimetic operation. * The multiplexer related to `(1)`, `(2)` and `(3)` is determining the candidate value of `Op2`. `Op2` is the operand 2 of this arthimetic operation. * `(2)` The value of register 2. * `(1)` The backpropagted value `(17)` from `WB` stage. Used to avoid data hazard by forwarding the calculation result to pipelines behind it. * `(3)` The backpropagted value `(20)` from `MEM` stage. Used to avoid data hazard by forwarding the calculation result to pipelines behind it. * The multiplexer related to `(4)`, `(5)` and `(6)` is determining the candidate value of `Op1`. `Op1` is the operand 1 of this arthimetic operation. * `(5)`, The value of register 1. * `(4)`, ditto to `(1)` * `(6)`, ditto to `(3)` * The multiplexer related to `(7)` and `(8)` is determine the actual value for Operand 1 * `(7)`, the `PC` counter value of this instruction. * `(8)`, the value will be one of `(4)`, `(5)`, or `(6)`. * The multiplexer related to `(9)` and `(10)` is determine the actual value for Operand 2 * `(10)`, the immediate value. * `(9)`, the value will be one of `(1)`, `(2)`, or `(3)` * `(13)` and `(14)` is the input of ALU unit. * `(15)` is the calculation result from ALU unit. * `(16)` backpropagate the calculation result to the `IF` multiplexer. For jumping or branching purpose. * `(11)` and `(12)` are used to determine if the branch is taken. Not sure what this block doing here since it doesn't connect to anything. Probably just here providing visual aid. > Suppose there is something called `Control Unit`, which run many flow control operation behind manything. But I don't see it in this diagram. It probably has been removed intentionally since drawing these connections on the diagram will make this diagram chaotic. > [name=Zheng-Xian Li] ### Memory Access (MEM) | ![](https://i.imgur.com/UZcifgt.png) | | --- | | <center>(Figure 8) the MEM stage, handling all the operation related to memory</center> | This stage deal with memory. The only block in Figure 8 is **Data Memory**. Which contain the logic for load/store memory. In the given Figure 8, we are executing the `lw x10 0 x10` instruction. Which loading a word from memory address `0x10000000`. The result is `0x7fffffff`. * `(3)` the calculated ALU result from last stage. * `(1)` the memory address for load/store. * `(2)` the given data input. * `(4)` the raw value of a multiplexer from last `EX` stage. * `(5)` backpropagte the value to pipelines behind it. * `Wr/en` selector, determine load or store operation. ### Register Write Back (WB) | ![](https://i.imgur.com/srEmqeL.png) | ![](https://i.imgur.com/MlOsZ3A.png)| | --- | --- | | <center>(Figure 9) The `WB` stage.</center> | <center>(Figure 10) backpropagted value of `4` and `5`</center> | This stage used to store the calculated value back to specific register. * `(1)` the `PC` counter value of this instruction * `(2)` the value of ALU calcuation result from `EX` stage. * `(3)` the loaded memory value. * `(5)` the reigster to write back. * `(4)` the write back value for specific register. In the Figure 9, we can see we are executing the instruction `auipc x10 0x10000`. It suppose to store the value of `0x100000000` to register `x10`. On the Figure 9 we can see that 1. `(4)` is `0x10000000`. 2. `(5)` is `0x0a`. And see Figure 10, We can see those values been backpropagated to the Registers block. with `Wr` selector enabled. So the register `x10` will set to value `0x10000000`. ## Bonus: Less Loop Version > Special thanks to [chinghongfang](https://github.com/chinghongfang), he tell me some clue QAQ. > [name=Zheng-Xian Li] The old method we start enumerate bits from $\sqrt{2^{31}-1}$. But in fact we can just start from the most siginifant bit of given $\sqrt{x}$. But how do we know the MSB of $\sqrt{x}$ before calculate the actual number $\sqrt{x}$? We can observed this by the MSB of $x$. A number can be written as a series of $2^n$ representation. like $$ 5566_{10}= 1010110111110_2 = 2^{12} + 2^{10} + 2^{8} + 2^{7} + 2^5 + 2^4 + 2^3 + 2^2 + 2^1 $$ If you sqrt this thing, you can always be sure the value $\sqrt{5566}$ will start from $2^6$. Which $6$ is the index of $x$ MSB divied by $2$. So if we know MSB index of $x$, we have save some loops. And here is a useful binary/math trick where you can get the index of the MSB bit $x$. That method related to **Count Leading Zero**. I found some article describe the method from our ancestors [[1]](https://hackmd.io/@workfunction/Byyd3nua?type=view) [[2]](https://hackmd.io/@jserv/B1LHefQ6?type=view) [[3]](https://hackmd.io/@jserv/SkKZBXZT?type=view) [[4]](https://hackmd.io/@ktvexe/H1B7-hGp?type=view) [[5]](https://stackoverflow.com/questions/11376288/fast-computing-of-log2-for-64-bit-integers). But none of the article describe what these method do in detail. Eventually I run into the article describe the method [here](https://sites.google.com/site/sydfhd/articles-tutorials/de-bruijn-sequence-generator). Describe what happened in detail. So here is the code: ```c // from https://sites.google.com/site/sydfhd/articles-tutorials/de-bruijn-sequence-generator int bit_scan_reverse(uint32_t b) { static const uint32_t Magic = (uint32_t) 0x07C4ACDD; static const int BitTable[32] = { u, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31, }; assert(b); b |= b >> 1; b |= b >> 2; b |= b >> 4; b |= b >> 8; b |= b >> 16; return BitTable[(b * Magic) >> 27]; } ``` What this method does is: 1. Attempts to retrieve the MSB of that given number $x$. 2. Multiply the number by `0x07C4ACDD`. Shift it by 27 bits. 3. Lookup the table. Now you got the index of the MSB of $x$ (0-based index). Now extend my old code with the new implementation. ```c= #include <stdio.h> #define u 0 // By Syed Fahad // from https://sites.google.com/site/sydfhd/articles-tutorials/de-bruijn-sequence-generator int bit_scan_reverse(unsigned int b) { static const unsigned int Magic = (unsigned int) 0x07C4ACDD; static const int BitTable[32] = { u, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31, }; b |= b >> 1; b |= b >> 2; b |= b >> 4; b |= b >> 8; b |= b >> 16; return BitTable[(b * Magic) >> 27]; } int mySqrtFast(int x){ unsigned int s = 0; for(unsigned int i = (1 << (bit_scan_reverse(x)>>1));i > 0;i >>= 1) if((s+i) * (s+i) <= x) s += i; return s; } int main() { printf("%d\n", mySqrtFast(5566)); return 0; } ``` ``` # https://leetcode.com/problems/sqrtx/ # Calculate floor(sqrt(x)) .data x: .word 0x7fffffff magic: .word 0x07C4ACDD table: .word 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4 ,31 .text main: # call floor(squrt(x)) lw a0, x # load argument x jal ra, floor_sqrt # call floor_sqrt(x) # print result mv a0, a0 # integer to print li a7, 36 # print int environment call (1) ecall # print int environment call # exit li a7, 10 # exit environment call id (10) ecall # exit environment call floor_sqrt: li t0, 0 # unsigned int s = 0; li t1, 1 # unsigned int i = 1; mv s0, a0 # s0 = a0, save before call routine mv s1, ra jal ra, bit_scan_reverse srli a0, a0, 1 # a0 = a0 / 2 sll t1, t1, a0 # i = i << (a0/2) mv ra, s1 mv a0, s0 for_next_bit: beqz t1, next_bit_end # if i == 0 goto next_bit_end add t2, t0, t1 # t2 = s + i mul t3, t2, t2 # t3 = (t2)^2 = (s+i)^2 bltu a0, t3, if_x_is_less_than_t3 # if a0 < t3 then don't add add t0, t0, t1 # s = s + i if_x_is_less_than_t3: srli t1, t1, 1 # i = i >> 1 j for_next_bit # goto for_next_bit next_bit_end: mv a0, t0 ret bit_scan_reverse: srli a1, a0, 1 # or a0, a0, a1 # a0 = a0 | (a0 >> 1) srli a1, a0, 2 # or a0, a0, a1 # a0 = a0 | (a0 >> 2) srli a1, a0, 4 # or a0, a0, a1 # a0 = a0 | (a0 >> 4) srli a1, a0, 8 # or a0, a0, a1 # a0 = a0 | (a0 >> 8) srli a1, a0, 16 # or a0, a0, a1 # a0 = a0 | (a0 >> 16) # multiply by magic lw a1, magic # a1 = &magic mul a0, a0, a1 # a0 = a0 * magic srli a0, a0, 27 # a0 = (a0 * magic) >> 27 # table lookup la a1, table # a1 = &table slli a0, a0, 2 # a0 = a0 * 4 add a0, a0, a1 # a0 = &table[a0*4] lw a0, 0(a0) # a0 = *table[a0*4] ret ``` ``` 00000000 <main>: 0: 10000517 auipc x10 0x10000 4: 00052503 lw x10 0 x10 8: 018000ef jal x1 24 <floor_sqrt> c: 00050513 addi x10 x10 0 10: 02400893 addi x17 x0 36 14: 00000073 ecall 18: 00a00893 addi x17 x0 10 1c: 00000073 ecall 00000020 <floor_sqrt>: 20: 00000293 addi x5 x0 0 24: 00100313 addi x6 x0 1 28: 00050413 addi x8 x10 0 2c: 00008493 addi x9 x1 0 30: 038000ef jal x1 56 <bit_scan_reverse> 34: 00155513 srli x10 x10 1 38: 00a31333 sll x6 x6 x10 3c: 00048093 addi x1 x9 0 40: 00040513 addi x10 x8 0 00000044 <for_next_bit>: 44: 00030e63 beq x6 x0 28 <next_bit_end> 48: 006283b3 add x7 x5 x6 4c: 02738e33 mul x28 x7 x7 50: 01c56463 bltu x10 x28 8 <if_x_is_less_than_t3> 54: 006282b3 add x5 x5 x6 00000058 <if_x_is_less_than_t3>: 58: 00135313 srli x6 x6 1 5c: fe9ff06f jal x0 -24 <for_next_bit> 00000060 <next_bit_end>: 60: 00028513 addi x10 x5 0 64: 00008067 jalr x0 x1 0 00000068 <bit_scan_reverse>: 68: 00155593 srli x11 x10 1 6c: 00b56533 or x10 x10 x11 70: 00255593 srli x11 x10 2 74: 00b56533 or x10 x10 x11 78: 00455593 srli x11 x10 4 7c: 00b56533 or x10 x10 x11 80: 00855593 srli x11 x10 8 84: 00b56533 or x10 x10 x11 88: 01055593 srli x11 x10 16 8c: 00b56533 or x10 x10 x11 90: 10000597 auipc x11 0x10000 94: f745a583 lw x11 -140 x11 98: 02b50533 mul x10 x10 x11 9c: 01b55513 srli x10 x10 27 a0: 10000597 auipc x11 0x10000 a4: f6858593 addi x11 x11 -152 a8: 00251513 slli x10 x10 2 ac: 00b50533 add x10 x10 x11 b0: 00052503 lw x10 0 x10 b4: 00008067 jalr x0 x1 0 ``` For now, given a number $x$, the loop will execute $1+\lceil log_{2}\sqrt{x} \rceil$ times. instead of $1+\lceil log_2(\sqrt{2^{31}-1}) \rceil$. This table compares both methods by executed cycles and loops | number | Original(Cycles) | New(Cycles) | Original(Loop) | New(Loop) | |--------|---|---|---|---| |0|188|69|17|2| |1|187|68|17|2| |123|185|96|17|5| |1024|187|118|17|7 |5566|185|126|17|7| |0x77777777|179|210|17|17| |0x7fffffff|182|213|17|17| ## Bonus 2: Exit Loop Earlier If the value of $\lfloor\sqrt{x}\rfloor$ has a lot of zero at the lower bit side, for example $1024_{10} = 10000000000_2$. We can leave this loop quicker by just adding one branch instruction. ```c= #include <stdio.h> #define u 0 // By Syed Fahad // from https://sites.google.com/site/sydfhd/articles-tutorials/de-bruijn-sequence-generator int bit_scan_reverse(unsigned int b) { static const unsigned int Magic = (unsigned int) 0x07C4ACDD; static const int BitTable[32] = { u, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31, }; b |= b >> 1; b |= b >> 2; b |= b >> 4; b |= b >> 8; b |= b >> 16; return BitTable[(b * Magic) >> 27]; } int mySqrtFast(int x){ unsigned int s = 0; for(unsigned int i = (1 << (bit_scan_reverse(x)>>1));i > 0;i >>= 1) if((s+i) * (s+i) <= x){ s += i; // if value match, exit this loop earlier. if((s+i)*(s+i) == x) break; } return s; } int main() { printf("%d\n", mySqrtFast(5566)); return 0; } ``` ``` # https://leetcode.com/problems/sqrtx/ # Calculate floor(sqrt(x)) .data x: .word 1024 magic: .word 0x07C4ACDD table: .word 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4 ,31 .text main: # call floor(squrt(x)) lw a0, x # load argument x jal ra, floor_sqrt # call floor_sqrt(x) # print result mv a0, a0 # integer to print li a7, 36 # print int environment call (1) ecall # print int environment call # exit li a7, 10 # exit environment call id (10) ecall # exit environment call floor_sqrt: li t0, 0 # unsigned int s = 0; li t1, 1 # unsigned int i = 1; mv s0, a0 # s0 = a0, save before call routine mv s1, ra jal ra, bit_scan_reverse srli a0, a0, 1 # a0 = a0 / 2 sll t1, t1, a0 # i = i << (a0/2) mv ra, s1 mv a0, s0 for_next_bit: beqz t1, next_bit_end # if i == 0 goto next_bit_end add t2, t0, t1 # t2 = s + i mul t3, t2, t2 # t3 = (t2)^2 = (s+i)^2 bltu a0, t3, if_x_is_less_than_t3 # if a0 < t3 then don't add add t0, t0, t1 # s = s + i beq a0, t3, next_bit_end # if s^2 == x, then t0 is the answer and we exit this loop earlier if_x_is_less_than_t3: srli t1, t1, 1 # i = i >> 1 j for_next_bit # goto for_next_bit next_bit_end: mv a0, t0 ret bit_scan_reverse: srli a1, a0, 1 # or a0, a0, a1 # a0 = a0 | (a0 >> 1) srli a1, a0, 2 # or a0, a0, a1 # a0 = a0 | (a0 >> 2) srli a1, a0, 4 # or a0, a0, a1 # a0 = a0 | (a0 >> 4) srli a1, a0, 8 # or a0, a0, a1 # a0 = a0 | (a0 >> 8) srli a1, a0, 16 # or a0, a0, a1 # a0 = a0 | (a0 >> 16) # multiply by magic lw a1, magic # a1 = &magic mul a0, a0, a1 # a0 = a0 * magic srli a0, a0, 27 # a0 = (a0 * magic) >> 27 # table lookup la a1, table # a1 = &table slli a0, a0, 2 # a0 = a0 * 4 add a0, a0, a1 # a0 = &table[a0*4] lw a0, 0(a0) # a0 = *table[a0*4] ret ``` ``` 00000000 <main>: 0: 10000517 auipc x10 0x10000 4: 00052503 lw x10 0 x10 8: 018000ef jal x1 24 <floor_sqrt> c: 00050513 addi x10 x10 0 10: 02400893 addi x17 x0 36 14: 00000073 ecall 18: 00a00893 addi x17 x0 10 1c: 00000073 ecall 00000020 <floor_sqrt>: 20: 00000293 addi x5 x0 0 24: 00100313 addi x6 x0 1 28: 00050413 addi x8 x10 0 2c: 00008493 addi x9 x1 0 30: 03c000ef jal x1 60 <bit_scan_reverse> 34: 00155513 srli x10 x10 1 38: 00a31333 sll x6 x6 x10 3c: 00048093 addi x1 x9 0 40: 00040513 addi x10 x8 0 00000044 <for_next_bit>: 44: 02030063 beq x6 x0 32 <next_bit_end> 48: 006283b3 add x7 x5 x6 4c: 02738e33 mul x28 x7 x7 50: 01c56663 bltu x10 x28 12 <if_x_is_less_than_t3> 54: 006282b3 add x5 x5 x6 58: 01c50663 beq x10 x28 12 <next_bit_end> 0000005c <if_x_is_less_than_t3>: 5c: 00135313 srli x6 x6 1 60: fe5ff06f jal x0 -28 <for_next_bit> 00000064 <next_bit_end>: 64: 00028513 addi x10 x5 0 68: 00008067 jalr x0 x1 0 0000006c <bit_scan_reverse>: 6c: 00155593 srli x11 x10 1 70: 00b56533 or x10 x10 x11 74: 00255593 srli x11 x10 2 78: 00b56533 or x10 x10 x11 7c: 00455593 srli x11 x10 4 80: 00b56533 or x10 x10 x11 84: 00855593 srli x11 x10 8 88: 00b56533 or x10 x10 x11 8c: 01055593 srli x11 x10 16 90: 00b56533 or x10 x10 x11 94: 10000597 auipc x11 0x10000 98: f705a583 lw x11 -144 x11 9c: 02b50533 mul x10 x10 x11 a0: 01b55513 srli x10 x10 27 a4: 10000597 auipc x11 0x10000 a8: f6458593 addi x11 x11 -156 ac: 00251513 slli x10 x10 2 b0: 00b50533 add x10 x10 x11 b4: 00052503 lw x10 0 x10 b8: 00008067 jalr x0 x1 0 ``` ### Benchmark | number | Orig(Cyc) | New(Cyc) | New2(Cyc) | Orig(Loop) | New(Loop) | New2(Loop) | |--------|---|---|---|---|---|---| |0|188|69|69|17|2|2 |1|187|68|64|17|2|1 |123|185|96|99|17|5|5 |1024|187|118|**64**|17|7|**1** |5566|185|126|129|17|8|8 |746819584|182|203|**144**|17|16|**9** |0x77754400|180|211|**164**|17|17|**11** |0x7fffffff|182|213|219|17|17|17