# Assignment 2-2: Single number ###### tags: `computer architure 2020` Problem description: This problem is based on [LeetCode 137](https://leetcode.com/problems/single-number-ii/). Given a non-empty array of integers, every element appears three times except for one, which appears exactly once. Find that single one. (Credit to the original author 鄭育丞 who wrote the assembly in [assignment 1](https://hackmd.io/@eecheng/HJojUvdrw)) ## C code implementation The C code implementation is modified to fit in the emulator as follow: ```c=1 #define SIZE 10 int singleNumber(int arr[]) { int lower = 0, higher = 0, mask = 0; for (int i = 0; i < SIZE; i++) { higher ^= lower & arr[i]; lower ^= arr[i]; mask = ~(lower & higher); higher &= mask; lower &= mask; } return lower; } void _start() { int arr[SIZE] = {5, 1, 1, 1, 9, 5, 5, 9, 4, 9}; int res = singleNumber(arr); volatile char *tx = (volatile char *) 0x40002000; if (res != -1) { *tx = (char) ((res + '0') & 0x000000FF); } } ``` ## Disassemble the ELF In this section the C code of binary search is compiled into RV32 binary with different compiler optimization level (`-O3` to `-O0`). The compiled assembly code is obtained by the command `riscv-none-embed-objdump -d binSearch`. The assembly code is lengthy and hard to read. Some unimportant section of code is omitted to better demonstrate the effect of compiler optimization. ### -O0 * Observation * LOC (Line Of Code): `51` * Allocate `48` bytes on stack * `-36(s0)` stores the base address of `arr` * `-20(s0)` stored the `i` variable * `-24(s0)` stored the `lower` variable * `-32(s0)` stored the `higher` variable * `-28(s0)` stored the `mask` variable * Register allocation * `a0` stores the base address of `arr` * `a4` stores the value of `arr[i]` variable * `a5` stores the `i` variable * **The computed temporary results in memory are frequently stored back to stack** * Number of registers used: `3` * `ax` registers: `a0` `a4` `a5` * `tx` registers: None * Number of `lw` and `sw`: `31` * 19 `lw` * 12 `sw` * **Notice the arguments are stored on stack and frequently accessed.** * Number of branch/jump instructions: `3` * The for loop in C code is compiled into one `j` and one `bge` instruction * **Without any compiler optimization, the assembly code follows the logic flow of the C implementation** ```c=1 00010054 <singleNumber>: 10054: fd010113 addi sp,sp,-48 10058: 02812623 sw s0,44(sp) 1005c: 03010413 addi s0,sp,48 10060: fca42e23 sw a0,-36(s0) 10064: fe042623 sw zero,-20(s0) 10068: fe042423 sw zero,-24(s0) 1006c: fe042023 sw zero,-32(s0) 10070: fe042223 sw zero,-28(s0) 10074: 08c0006f j 10100 <singleNumber+0xac> 10078: fe442783 lw a5,-28(s0) 1007c: 00279793 slli a5,a5,0x2 10080: fdc42703 lw a4,-36(s0) 10084: 00f707b3 add a5,a4,a5 10088: 0007a703 lw a4,0(a5) 1008c: fec42783 lw a5,-20(s0) 10090: 00f777b3 and a5,a4,a5 10094: fe842703 lw a4,-24(s0) 10098: 00f747b3 xor a5,a4,a5 1009c: fef42423 sw a5,-24(s0) 100a0: fe442783 lw a5,-28(s0) 100a4: 00279793 slli a5,a5,0x2 100a8: fdc42703 lw a4,-36(s0) 100ac: 00f707b3 add a5,a4,a5 100b0: 0007a783 lw a5,0(a5) 100b4: fec42703 lw a4,-20(s0) 100b8: 00f747b3 xor a5,a4,a5 100bc: fef42623 sw a5,-20(s0) 100c0: fec42703 lw a4,-20(s0) 100c4: fe842783 lw a5,-24(s0) 100c8: 00f777b3 and a5,a4,a5 100cc: fff7c793 not a5,a5 100d0: fef42023 sw a5,-32(s0) 100d4: fe842703 lw a4,-24(s0) 100d8: fe042783 lw a5,-32(s0) 100dc: 00f777b3 and a5,a4,a5 100e0: fef42423 sw a5,-24(s0) 100e4: fec42703 lw a4,-20(s0) 100e8: fe042783 lw a5,-32(s0) 100ec: 00f777b3 and a5,a4,a5 100f0: fef42623 sw a5,-20(s0) 100f4: fe442783 lw a5,-28(s0) 100f8: 00178793 addi a5,a5,1 100fc: fef42223 sw a5,-28(s0) 10100: fe442703 lw a4,-28(s0) 10104: 00900793 li a5,9 10108: f6e7d8e3 bge a5,a4,10078 <singleNumber+0x24> 1010c: fec42783 lw a5,-20(s0) 10110: 00078513 mv a0,a5 10114: 02c12403 lw s0,44(sp) 10118: 03010113 addi sp,sp,48 1011c: 00008067 ret ``` ### -O1 * Observation * LOC (Line Of Code): `15` * Allocate `0` bytes on stack * Register allocation * `a0` stores the `lower` variable * `a1` stores the address of the last element of `arr` * `a2` stores the value of `arr[i]` * `a3` stores the `higher` variable * `a4` stores the address of `arr[i]` * `a5` stores the intermediate value * Number of registers used: `6` * `ax` registers: `a0` to `a5` * `tx` registers: None * Number of `lw` and `sw`: `1` * 1 `lw` * 0 `sw` * Number of jump/branch instructions: `3` ```c=1 00010054 <singleNumber>: 10054: 00050713 mv a4,a0 10058: 02850593 addi a1,a0,40 1005c: 00000693 li a3,0 10060: 00000513 li a0,0 10064: 00072603 lw a2,0(a4) 10068: 00a677b3 and a5,a2,a0 1006c: 00d7c7b3 xor a5,a5,a3 10070: 00a64533 xor a0,a2,a0 10074: fff54693 not a3,a0 10078: 00f6f6b3 and a3,a3,a5 1007c: fff7c793 not a5,a5 10080: 00a7f533 and a0,a5,a0 10084: 00470713 addi a4,a4,4 10088: fcb71ee3 bne a4,a1,10064 <singleNumber+0x10> 1008c: 00008067 ret ``` ### -O2 * Observation * LOC (Line Of Code): `15` * Allocate `0` bytes on stack * Register allocation * `a0` stores the address of `arr[i]` * `a1` stores the address of the last element of `arr` * `a2` stores the intermediate value * `a3` stores the value of `arr[i]` * `a4` stores the `higher` variable * `a5` stores the `lower` variable * **-O1 and -O2 only differs in the naming of registers** * Number of registers used: `6` * `ax` registers: `a0` to `a5` * `tx` registers: None * Number of `lw` and `sw`: `1` * 1 `lw` * 0 `sw` * Number of jump/branch instructions: `2` ```c=1 00010054 <singleNumber>: 10054: 02850593 addi a1,a0,40 10058: 00000713 li a4,0 1005c: 00000793 li a5,0 10060: 00052683 lw a3,0(a0) 10064: 00450513 addi a0,a0,4 10068: 00f6f633 and a2,a3,a5 1006c: 00e64733 xor a4,a2,a4 10070: 00f6c7b3 xor a5,a3,a5 10074: fff7c613 not a2,a5 10078: fff74693 not a3,a4 1007c: 00f6f7b3 and a5,a3,a5 10080: 00e67733 and a4,a2,a4 10084: fca59ee3 bne a1,a0,10060 <singleNumber+0xc> 10088: 00078513 mv a0,a5 1008c: 00008067 ret ``` ### -O3 * Observation * LOC (Line Of Code): `71` * Allocate `0` bytes on stack * Register allocation * The register allocation rules are too complicated to interpret by human * **The compiler "loop unroll" for loop to save avoid branch misprediction at the cost of larger code size** * Number of registers used: `6` * `ax` registers: `a0` to `a5` * `tx` registers: None * Number of `lw` and `sw`: `10` * 10 `lw` * 0 `sw` * Number of jump/branch instructions: `1` ```c=1 00010054 <singleNumber>: 10054: 00452783 lw a5,4(a0) 10058: 00052683 lw a3,0(a0) 1005c: 00852703 lw a4,8(a0) 10060: 00c52583 lw a1,12(a0) 10064: 00f6f633 and a2,a3,a5 10068: 00f6c6b3 xor a3,a3,a5 1006c: fff64793 not a5,a2 10070: 00d7f7b3 and a5,a5,a3 10074: fff6c693 not a3,a3 10078: 00c6f6b3 and a3,a3,a2 1007c: 00e7f633 and a2,a5,a4 10080: 00d646b3 xor a3,a2,a3 10084: 00e7c7b3 xor a5,a5,a4 10088: fff6c713 not a4,a3 1008c: 00f77733 and a4,a4,a5 10090: fff7c793 not a5,a5 10094: 00d7f7b3 and a5,a5,a3 10098: 00b776b3 and a3,a4,a1 1009c: 01052603 lw a2,16(a0) 100a0: 00f6c6b3 xor a3,a3,a5 100a4: 00b74733 xor a4,a4,a1 100a8: fff6c793 not a5,a3 100ac: 00e7f7b3 and a5,a5,a4 100b0: fff74713 not a4,a4 100b4: 00d77733 and a4,a4,a3 100b8: 00c7f6b3 and a3,a5,a2 100bc: 01452583 lw a1,20(a0) 100c0: 00e6c6b3 xor a3,a3,a4 100c4: 00c7c7b3 xor a5,a5,a2 100c8: fff6c713 not a4,a3 100cc: 00f77733 and a4,a4,a5 100d0: fff7c793 not a5,a5 100d4: 00d7f7b3 and a5,a5,a3 100d8: 00b776b3 and a3,a4,a1 100dc: 01852603 lw a2,24(a0) 100e0: 00f6c6b3 xor a3,a3,a5 100e4: 00b74733 xor a4,a4,a1 100e8: fff6c793 not a5,a3 100ec: 00e7f7b3 and a5,a5,a4 100f0: fff74713 not a4,a4 100f4: 00d77733 and a4,a4,a3 100f8: 00c7f6b3 and a3,a5,a2 100fc: 01c52583 lw a1,28(a0) 10100: 00e6c6b3 xor a3,a3,a4 10104: 00c7c7b3 xor a5,a5,a2 10108: fff6c713 not a4,a3 1010c: 00f77733 and a4,a4,a5 10110: fff7c793 not a5,a5 10114: 00d7f7b3 and a5,a5,a3 10118: 00b776b3 and a3,a4,a1 1011c: 02052603 lw a2,32(a0) 10120: 00f6c6b3 xor a3,a3,a5 10124: 00b74733 xor a4,a4,a1 10128: fff6c793 not a5,a3 1012c: 00e7f7b3 and a5,a5,a4 10130: fff74713 not a4,a4 10134: 00d77733 and a4,a4,a3 10138: 00c7f6b3 and a3,a5,a2 1013c: 00e6c733 xor a4,a3,a4 10140: 02452683 lw a3,36(a0) 10144: 00c7c7b3 xor a5,a5,a2 10148: fff74513 not a0,a4 1014c: 00f57633 and a2,a0,a5 10150: fff7c793 not a5,a5 10154: 00c6f533 and a0,a3,a2 10158: 00e7f7b3 and a5,a5,a4 1015c: 00f547b3 xor a5,a0,a5 10160: fff7c793 not a5,a5 10164: 00c6c533 xor a0,a3,a2 10168: 00a7f533 and a0,a5,a0 1016c: 00008067 ret ``` ### Summary: Observed Compiler Behavior :::info * From `-O0` to `-O1` * Compensates the additional usage of 3 registers with significantly less lw/sw count and no stack space * From `-O1` to `-O2` * Nearly identical * From `-O2` to `-O3` * Loop unroll to avoid branch misprediction at the cost to code size ::: | | -O0 | -O1 | -O2 | -O3 | |--------------------|-----------|-----------|-----------|-----------| | Line of Code | 51 | 15 | 15 | 71 | | Register used | 3 | 6 | 6 | 6 | | Stack used (bytes) | 48 | 0 | 0 | 0 | | lw/sw count | 31 | 1 | 1 | 10 | | loop unrolling | N | N | N | Y | ## Statistics from emulator ### -O0 ```c=1 >>> Execution time: 14998 ns >>> Instruction count: 442 (IPS=29470596) >>> Jumps: 14 (3.17%) - 2 forwards, 12 backwards >>> Branching T=10 (83.33%) F=2 (16.67%) ``` ### -O1 ```c=1 >>> Execution time: 7662 ns >>> Instruction count: 141 (IPS=18402505) >>> Jumps: 12 (8.51%) - 1 forwards, 11 backwards >>> Branching T=9 (81.82%) F=2 (18.18%) ``` ### -O2 ```c=1 >>> Execution time: 7801 ns >>> Instruction count: 136 (IPS=17433662) >>> Jumps: 10 (7.35%) - 0 forwards, 10 backwards >>> Branching T=9 (81.82%) F=2 (18.18%) ``` ### -O3 ```c=1 >>> Execution time: 6571 ns >>> Instruction count: 82 (IPS=12479074) >>> Jumps: 1 (1.22%) - 0 forwards, 1 backwards >>> Branching T=0 (0.00%) F=1 (100.00%) ``` ![](https://i.imgur.com/I3L886X.png) ## Beating gcc compiler in speed ? in size ? TBD