# Assignment2: RISC-V Toolchain Contributed by [kc71486](https://github.com/kc71486) ###### tags: `RISC-V`, `jserv` ## Install riscv rv32emu toolchain Follow the instruction from [this note](https://hackmd.io/@sysprog/rJAufgHYS)<br> Ignore `Configure $PATH` part, and add riscv-none-elf-gcc and rv32emu into `~/.bashrc` instead. ```bash=120 # add riscv-none-elf-gcc into PATH export PATH="$PATH:~/riscv-none-elf-gcc/bin" # add rv32emu into PATH export PATH="$PATH:~/rv32emu/build" ``` Then `source` the .bashrc file ```bash source .bashrc ``` ## Pick a Program The following question is picked from the Assignment 1 > 唐飴苹 Calculate the Hamming Distance using Counting Leading Zeros > The Hamming Distance between two integers is defined as the number of differing bits at the same position when comparing the binary representations of the integers. For example, the Hamming Distance between 1011101 and 1001001 is 2. > In the assignment, I implement the program to calculate the Hamming Distance between the two given 64-bit unsigned integers. [Source code](https://github.com/yptang5488/Computer-Architecture) The original implementation of the questions is as follows. ```c= int32_t HammingDistance_c(uint64_t x0, uint64_t x1) { int Hdist = 0; int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1); while(max_digit > 0){ uint64_t c1 = x0 & 1; uint64_t c2 = x1 & 1; if(c1 != c2) Hdist += 1; x0 = x0 >> 1; x1 = x1 >> 1; max_digit -= 1; } return Hdist; } ``` There is an easier way to achieve this, but since it is probably not how we should do the assignment, I am not doing it. Easier way: ```c int32_t HammingDistance_c(uint64_t x0, uint64_t x1) { int64_t diff = x0 ^ x1; return population_count(diff); } ``` The reason I choose this question is that hamming code is used in error correcting code, which is widely used in main memory, especially in servers, where reliability is very important. ### Compile / excecute the code To compile the .c file, I wrote makefile. :::spoiler makefile ``` .PHONY: clean include toolchain.mk ASFLAGS = -march=rv32i_zicsr -mabi=ilp32 CFLAGS = -O0 -Wall -Wextra OBJS = \ getcycles.o \ main.o \ hammingc_v0.o \ hammings_v0.o BIN = hammingdistance.elf %.o: %.s riscv-none-elf-gcc $(ASFLAGS) -c -o $@ $< %.o: %.c riscv-none-elf-gcc $(ASFLAGS) $(CFLAGS) -c -o $@ $< all: $(BIN) $(BIN): $(OBJS) riscv-none-elf-gcc -o $@ $^ ``` ::: And use `make` to compile.<br> There are some modification I need to make: * Modify assembly flag (include csr) to get current cycle. * Link with riscv-none-elf-gcc (instead of riscv-none-elf-ld) in order to use standard library such as printf function. * Change CFLAGS according to the optimization level For every test, I run the shell that includes: ``` riscv-none-elf-readelf -h hammingdistance.elf echo "" riscv-none-elf-size hammingdistance.elf echo "" rv32emu hammingdistance.elf ``` #### gcc O0 optimization > dec:56266, cycle:7517, result: (correct) O0 is the baseline of everything. Nothing special other than large program size. :::spoiler Disassembled ``` 00010220 <count_leading_zeros>: 10220: f7010113 add sp,sp,-144 10224: 08812623 sw s0,140(sp) 10228: 09212423 sw s2,136(sp) 1022c: 09312223 sw s3,132(sp) 10230: 09412023 sw s4,128(sp) 10234: 07512e23 sw s5,124(sp) 10238: 07612c23 sw s6,120(sp) 1023c: 07712a23 sw s7,116(sp) 10240: 07812823 sw s8,112(sp) 10244: 07912623 sw s9,108(sp) 10248: 07a12423 sw s10,104(sp) 1024c: 07b12223 sw s11,100(sp) 10250: 09010413 add s0,sp,144 10254: fca42423 sw a0,-56(s0) 10258: fcb42623 sw a1,-52(s0) 1025c: fcc42783 lw a5,-52(s0) 10260: 01f79713 sll a4,a5,0x1f 10264: fc842783 lw a5,-56(s0) 10268: 0017d613 srl a2,a5,0x1 1026c: 00c76633 or a2,a4,a2 10270: fcc42783 lw a5,-52(s0) 10274: 0017d693 srl a3,a5,0x1 10278: fc842783 lw a5,-56(s0) 1027c: 00c7ee33 or t3,a5,a2 10280: fcc42783 lw a5,-52(s0) 10284: 00d7eeb3 or t4,a5,a3 10288: fdc42423 sw t3,-56(s0) 1028c: fdd42623 sw t4,-52(s0) 10290: fcc42783 lw a5,-52(s0) 10294: 01e79713 sll a4,a5,0x1e 10298: fc842783 lw a5,-56(s0) 1029c: 0027d813 srl a6,a5,0x2 102a0: 01076833 or a6,a4,a6 102a4: fcc42783 lw a5,-52(s0) 102a8: 0027d893 srl a7,a5,0x2 102ac: fc842783 lw a5,-56(s0) 102b0: 0107e7b3 or a5,a5,a6 102b4: f8f42823 sw a5,-112(s0) 102b8: fcc42783 lw a5,-52(s0) 102bc: 0117e7b3 or a5,a5,a7 102c0: f8f42a23 sw a5,-108(s0) 102c4: f9042783 lw a5,-112(s0) 102c8: f9442803 lw a6,-108(s0) 102cc: fcf42423 sw a5,-56(s0) 102d0: fd042623 sw a6,-52(s0) 102d4: fcc42783 lw a5,-52(s0) 102d8: 01c79713 sll a4,a5,0x1c 102dc: fc842783 lw a5,-56(s0) 102e0: 0047d313 srl t1,a5,0x4 102e4: 00676333 or t1,a4,t1 102e8: fcc42783 lw a5,-52(s0) 102ec: 0047d393 srl t2,a5,0x4 102f0: fc842783 lw a5,-56(s0) 102f4: 0067e7b3 or a5,a5,t1 102f8: f8f42423 sw a5,-120(s0) 102fc: fcc42783 lw a5,-52(s0) 10300: 0077e7b3 or a5,a5,t2 10304: f8f42623 sw a5,-116(s0) 10308: f8842783 lw a5,-120(s0) 1030c: f8c42803 lw a6,-116(s0) 10310: fcf42423 sw a5,-56(s0) 10314: fd042623 sw a6,-52(s0) 10318: fcc42783 lw a5,-52(s0) 1031c: 01879713 sll a4,a5,0x18 10320: fc842783 lw a5,-56(s0) 10324: 0087d913 srl s2,a5,0x8 10328: 01276933 or s2,a4,s2 1032c: fcc42783 lw a5,-52(s0) 10330: 0087d993 srl s3,a5,0x8 10334: fc842783 lw a5,-56(s0) 10338: 0127e7b3 or a5,a5,s2 1033c: f8f42023 sw a5,-128(s0) 10340: fcc42783 lw a5,-52(s0) 10344: 0137e7b3 or a5,a5,s3 10348: f8f42223 sw a5,-124(s0) 1034c: f8042783 lw a5,-128(s0) 10350: f8442803 lw a6,-124(s0) 10354: fcf42423 sw a5,-56(s0) 10358: fd042623 sw a6,-52(s0) 1035c: fcc42783 lw a5,-52(s0) 10360: 01079713 sll a4,a5,0x10 10364: fc842783 lw a5,-56(s0) 10368: 0107d793 srl a5,a5,0x10 1036c: fcf42023 sw a5,-64(s0) 10370: fc042783 lw a5,-64(s0) 10374: 00f767b3 or a5,a4,a5 10378: fcf42023 sw a5,-64(s0) 1037c: fcc42783 lw a5,-52(s0) 10380: 0107d793 srl a5,a5,0x10 10384: fcf42223 sw a5,-60(s0) 10388: fc842783 lw a5,-56(s0) 1038c: fc042603 lw a2,-64(s0) 10390: fc442683 lw a3,-60(s0) 10394: 00060713 mv a4,a2 10398: 00e7e7b3 or a5,a5,a4 1039c: f6f42c23 sw a5,-136(s0) 103a0: fcc42783 lw a5,-52(s0) 103a4: 00068713 mv a4,a3 103a8: 00e7e7b3 or a5,a5,a4 103ac: f6f42e23 sw a5,-132(s0) 103b0: f7842783 lw a5,-136(s0) 103b4: f7c42803 lw a6,-132(s0) 103b8: fcf42423 sw a5,-56(s0) 103bc: fd042623 sw a6,-52(s0) 103c0: fcc42783 lw a5,-52(s0) 103c4: 0007d793 srl a5,a5,0x0 103c8: faf42c23 sw a5,-72(s0) 103cc: fa042e23 sw zero,-68(s0) 103d0: fc842783 lw a5,-56(s0) 103d4: fb842603 lw a2,-72(s0) 103d8: fbc42683 lw a3,-68(s0) 103dc: 00060713 mv a4,a2 103e0: 00e7e7b3 or a5,a5,a4 103e4: f6f42823 sw a5,-144(s0) 103e8: fcc42783 lw a5,-52(s0) 103ec: 00068713 mv a4,a3 103f0: 00e7e7b3 or a5,a5,a4 103f4: f6f42a23 sw a5,-140(s0) 103f8: f7042783 lw a5,-144(s0) 103fc: f7442803 lw a6,-140(s0) 10400: fcf42423 sw a5,-56(s0) 10404: fd042623 sw a6,-52(s0) 10408: fcc42783 lw a5,-52(s0) 1040c: 01f79793 sll a5,a5,0x1f 10410: fc842703 lw a4,-56(s0) 10414: 00175d13 srl s10,a4,0x1 10418: 01a7ed33 or s10,a5,s10 1041c: fcc42783 lw a5,-52(s0) 10420: 0017dd93 srl s11,a5,0x1 10424: 555557b7 lui a5,0x55555 10428: 55578793 add a5,a5,1365 # 55555555 <__BSS_END__+0x55536691> 1042c: 00fd77b3 and a5,s10,a5 10430: faf42823 sw a5,-80(s0) 10434: 555557b7 lui a5,0x55555 10438: 55578793 add a5,a5,1365 # 55555555 <__BSS_END__+0x55536691> 1043c: 00fdf7b3 and a5,s11,a5 10440: faf42a23 sw a5,-76(s0) 10444: fc842603 lw a2,-56(s0) 10448: fcc42683 lw a3,-52(s0) 1044c: fb042803 lw a6,-80(s0) 10450: fb442883 lw a7,-76(s0) 10454: 00080593 mv a1,a6 10458: 40b60733 sub a4,a2,a1 1045c: 00070593 mv a1,a4 10460: 00b635b3 sltu a1,a2,a1 10464: 00088513 mv a0,a7 10468: 40a687b3 sub a5,a3,a0 1046c: 40b786b3 sub a3,a5,a1 10470: 00068793 mv a5,a3 10474: fce42423 sw a4,-56(s0) 10478: fcf42623 sw a5,-52(s0) 1047c: fcc42783 lw a5,-52(s0) 10480: 01e79793 sll a5,a5,0x1e 10484: fc842703 lw a4,-56(s0) 10488: 00275c13 srl s8,a4,0x2 1048c: 0187ec33 or s8,a5,s8 10490: fcc42783 lw a5,-52(s0) 10494: 0027dc93 srl s9,a5,0x2 10498: 333337b7 lui a5,0x33333 1049c: 33378793 add a5,a5,819 # 33333333 <__BSS_END__+0x3331446f> 104a0: 00fc77b3 and a5,s8,a5 104a4: faf42423 sw a5,-88(s0) 104a8: 333337b7 lui a5,0x33333 104ac: 33378793 add a5,a5,819 # 33333333 <__BSS_END__+0x3331446f> 104b0: 00fcf7b3 and a5,s9,a5 104b4: faf42623 sw a5,-84(s0) 104b8: fc842703 lw a4,-56(s0) 104bc: 333337b7 lui a5,0x33333 104c0: 33378793 add a5,a5,819 # 33333333 <__BSS_END__+0x3331446f> 104c4: 00f777b3 and a5,a4,a5 104c8: faf42023 sw a5,-96(s0) 104cc: fcc42703 lw a4,-52(s0) 104d0: 333337b7 lui a5,0x33333 104d4: 33378793 add a5,a5,819 # 33333333 <__BSS_END__+0x3331446f> 104d8: 00f777b3 and a5,a4,a5 104dc: faf42223 sw a5,-92(s0) 104e0: fa842503 lw a0,-88(s0) 104e4: fac42583 lw a1,-84(s0) 104e8: 00050693 mv a3,a0 104ec: fa042803 lw a6,-96(s0) 104f0: fa442883 lw a7,-92(s0) 104f4: 00080613 mv a2,a6 104f8: 00c68733 add a4,a3,a2 104fc: 00070693 mv a3,a4 10500: 00050613 mv a2,a0 10504: 00c6b6b3 sltu a3,a3,a2 10508: 00058613 mv a2,a1 1050c: 00088593 mv a1,a7 10510: 00b607b3 add a5,a2,a1 10514: 00f686b3 add a3,a3,a5 10518: 00068793 mv a5,a3 1051c: fce42423 sw a4,-56(s0) 10520: fcf42623 sw a5,-52(s0) 10524: fcc42783 lw a5,-52(s0) 10528: 01c79793 sll a5,a5,0x1c 1052c: fc842703 lw a4,-56(s0) 10530: 00475f13 srl t5,a4,0x4 10534: 01e7ef33 or t5,a5,t5 10538: fcc42783 lw a5,-52(s0) 1053c: 0047df93 srl t6,a5,0x4 10540: fc842603 lw a2,-56(s0) 10544: fcc42683 lw a3,-52(s0) 10548: 00cf0733 add a4,t5,a2 1054c: 00070593 mv a1,a4 10550: 01e5b5b3 sltu a1,a1,t5 10554: 00df87b3 add a5,t6,a3 10558: 00f586b3 add a3,a1,a5 1055c: 00068793 mv a5,a3 10560: 0f0f16b7 lui a3,0xf0f1 10564: f0f68693 add a3,a3,-241 # f0f0f0f <__BSS_END__+0xf0d204b> 10568: 00d776b3 and a3,a4,a3 1056c: fcd42423 sw a3,-56(s0) 10570: 0f0f16b7 lui a3,0xf0f1 10574: f0f68693 add a3,a3,-241 # f0f0f0f <__BSS_END__+0xf0d204b> 10578: 00d7f7b3 and a5,a5,a3 1057c: fcf42623 sw a5,-52(s0) 10580: fcc42783 lw a5,-52(s0) 10584: 01879793 sll a5,a5,0x18 10588: fc842703 lw a4,-56(s0) 1058c: 00875b13 srl s6,a4,0x8 10590: 0167eb33 or s6,a5,s6 10594: fcc42783 lw a5,-52(s0) 10598: 0087db93 srl s7,a5,0x8 1059c: fc842603 lw a2,-56(s0) 105a0: fcc42683 lw a3,-52(s0) 105a4: 01660733 add a4,a2,s6 105a8: 00070593 mv a1,a4 105ac: 00c5b5b3 sltu a1,a1,a2 105b0: 017687b3 add a5,a3,s7 105b4: 00f586b3 add a3,a1,a5 105b8: 00068793 mv a5,a3 105bc: fce42423 sw a4,-56(s0) 105c0: fcf42623 sw a5,-52(s0) 105c4: fcc42783 lw a5,-52(s0) 105c8: 01079793 sll a5,a5,0x10 105cc: fc842703 lw a4,-56(s0) 105d0: 01075a13 srl s4,a4,0x10 105d4: 0147ea33 or s4,a5,s4 105d8: fcc42783 lw a5,-52(s0) 105dc: 0107da93 srl s5,a5,0x10 105e0: fc842603 lw a2,-56(s0) 105e4: fcc42683 lw a3,-52(s0) 105e8: 01460733 add a4,a2,s4 105ec: 00070593 mv a1,a4 105f0: 00c5b5b3 sltu a1,a1,a2 105f4: 015687b3 add a5,a3,s5 105f8: 00f586b3 add a3,a1,a5 105fc: 00068793 mv a5,a3 10600: fce42423 sw a4,-56(s0) 10604: fcf42623 sw a5,-52(s0) 10608: fcc42783 lw a5,-52(s0) 1060c: 0007d793 srl a5,a5,0x0 10610: f8f42c23 sw a5,-104(s0) 10614: f8042e23 sw zero,-100(s0) 10618: fc842603 lw a2,-56(s0) 1061c: fcc42683 lw a3,-52(s0) 10620: f9842803 lw a6,-104(s0) 10624: f9c42883 lw a7,-100(s0) 10628: 00080593 mv a1,a6 1062c: 00b60733 add a4,a2,a1 10630: 00070593 mv a1,a4 10634: 00c5b5b3 sltu a1,a1,a2 10638: 00088513 mv a0,a7 1063c: 00a687b3 add a5,a3,a0 10640: 00f586b3 add a3,a1,a5 10644: 00068793 mv a5,a3 10648: fce42423 sw a4,-56(s0) 1064c: fcf42623 sw a5,-52(s0) 10650: fc845783 lhu a5,-56(s0) 10654: 07f7f793 and a5,a5,127 10658: 01079793 sll a5,a5,0x10 1065c: 0107d793 srl a5,a5,0x10 10660: 04000713 li a4,64 10664: 40f707b3 sub a5,a4,a5 10668: 01079793 sll a5,a5,0x10 1066c: 0107d793 srl a5,a5,0x10 10670: 00078513 mv a0,a5 10674: 08c12403 lw s0,140(sp) 10678: 08812903 lw s2,136(sp) 1067c: 08412983 lw s3,132(sp) 10680: 08012a03 lw s4,128(sp) 10684: 07c12a83 lw s5,124(sp) 10688: 07812b03 lw s6,120(sp) 1068c: 07412b83 lw s7,116(sp) 10690: 07012c03 lw s8,112(sp) 10694: 06c12c83 lw s9,108(sp) 10698: 06812d03 lw s10,104(sp) 1069c: 06412d83 lw s11,100(sp) 106a0: 09010113 add sp,sp,144 106a4: 00008067 ret 000106a8 <HammingDistance_c>: 106a8: fa010113 add sp,sp,-96 106ac: 04112e23 sw ra,92(sp) 106b0: 04812c23 sw s0,88(sp) 106b4: 05212a23 sw s2,84(sp) 106b8: 05312823 sw s3,80(sp) 106bc: 05412623 sw s4,76(sp) 106c0: 05512423 sw s5,72(sp) 106c4: 05612223 sw s6,68(sp) 106c8: 05712023 sw s7,64(sp) 106cc: 03812e23 sw s8,60(sp) 106d0: 03912c23 sw s9,56(sp) 106d4: 06010413 add s0,sp,96 106d8: faa42423 sw a0,-88(s0) 106dc: fab42623 sw a1,-84(s0) 106e0: fac42023 sw a2,-96(s0) 106e4: fad42223 sw a3,-92(s0) 106e8: fc042623 sw zero,-52(s0) 106ec: fa842603 lw a2,-88(s0) 106f0: fac42683 lw a3,-84(s0) 106f4: fa042703 lw a4,-96(s0) 106f8: fa442783 lw a5,-92(s0) 106fc: 00068513 mv a0,a3 10700: 00078593 mv a1,a5 10704: 00a5ee63 bltu a1,a0,10720 <HammingDistance_c+0x78> 10708: 00068513 mv a0,a3 1070c: 00078593 mv a1,a5 10710: 00b51c63 bne a0,a1,10728 <HammingDistance_c+0x80> 10714: 00060513 mv a0,a2 10718: 00070593 mv a1,a4 1071c: 00a5f663 bgeu a1,a0,10728 <HammingDistance_c+0x80> 10720: 00060713 mv a4,a2 10724: 00068793 mv a5,a3 10728: 00070513 mv a0,a4 1072c: 00078593 mv a1,a5 10730: af1ff0ef jal 10220 <count_leading_zeros> 10734: 00050793 mv a5,a0 10738: 00078713 mv a4,a5 1073c: 04000793 li a5,64 10740: 40e787b3 sub a5,a5,a4 10744: 01079793 sll a5,a5,0x10 10748: 0107d793 srl a5,a5,0x10 1074c: fcf41523 sh a5,-54(s0) 10750: 0b40006f j 10804 <HammingDistance_c+0x15c> 10754: fa842783 lw a5,-88(s0) 10758: 0017fb13 and s6,a5,1 1075c: fac42783 lw a5,-84(s0) 10760: 0007fb93 and s7,a5,0 10764: fd642023 sw s6,-64(s0) 10768: fd742223 sw s7,-60(s0) 1076c: fa042783 lw a5,-96(s0) 10770: 0017fc13 and s8,a5,1 10774: fa442783 lw a5,-92(s0) 10778: 0007fc93 and s9,a5,0 1077c: fb842c23 sw s8,-72(s0) 10780: fb942e23 sw s9,-68(s0) 10784: fc042703 lw a4,-64(s0) 10788: fb842783 lw a5,-72(s0) 1078c: 00f71863 bne a4,a5,1079c <HammingDistance_c+0xf4> 10790: fc442703 lw a4,-60(s0) 10794: fbc42783 lw a5,-68(s0) 10798: 00f70863 beq a4,a5,107a8 <HammingDistance_c+0x100> 1079c: fcc42783 lw a5,-52(s0) 107a0: 00178793 add a5,a5,1 107a4: fcf42623 sw a5,-52(s0) 107a8: fac42783 lw a5,-84(s0) 107ac: 01f79793 sll a5,a5,0x1f 107b0: fa842703 lw a4,-88(s0) 107b4: 00175913 srl s2,a4,0x1 107b8: 0127e933 or s2,a5,s2 107bc: fac42783 lw a5,-84(s0) 107c0: 0017d993 srl s3,a5,0x1 107c4: fb242423 sw s2,-88(s0) 107c8: fb342623 sw s3,-84(s0) 107cc: fa442783 lw a5,-92(s0) 107d0: 01f79793 sll a5,a5,0x1f 107d4: fa042703 lw a4,-96(s0) 107d8: 00175a13 srl s4,a4,0x1 107dc: 0147ea33 or s4,a5,s4 107e0: fa442783 lw a5,-92(s0) 107e4: 0017da93 srl s5,a5,0x1 107e8: fb442023 sw s4,-96(s0) 107ec: fb542223 sw s5,-92(s0) 107f0: fca45783 lhu a5,-54(s0) 107f4: fff78793 add a5,a5,-1 107f8: 01079793 sll a5,a5,0x10 107fc: 0107d793 srl a5,a5,0x10 10800: fcf41523 sh a5,-54(s0) 10804: fca41783 lh a5,-54(s0) 10808: f4f046e3 bgtz a5,10754 <HammingDistance_c+0xac> 1080c: fcc42783 lw a5,-52(s0) 10810: 00078513 mv a0,a5 10814: 05c12083 lw ra,92(sp) 10818: 05812403 lw s0,88(sp) 1081c: 05412903 lw s2,84(sp) 10820: 05012983 lw s3,80(sp) 10824: 04c12a03 lw s4,76(sp) 10828: 04812a83 lw s5,72(sp) 1082c: 04412b03 lw s6,68(sp) 10830: 04012b83 lw s7,64(sp) 10834: 03c12c03 lw s8,60(sp) 10838: 03812c83 lw s9,56(sp) 1083c: 06010113 add sp,sp,96 10840: 00008067 ret ``` ::: A lot of load/store, no wonder it is so slow. #### gcc O1 optimization > dec:55106, cycle:2562, result: (correct) Much faster than O0. Also smaller code size. :::spoiler Disassembled ``` 0001021c <HammingDistance_c>: 1021c: 00050713 mv a4,a0 10220: 00060813 mv a6,a2 10224: 00068893 mv a7,a3 10228: 00b6e663 bltu a3,a1,10234 <HammingDistance_c+0x18> 1022c: 00d59863 bne a1,a3,1023c <HammingDistance_c+0x20> 10230: 00a67663 bgeu a2,a0,1023c <HammingDistance_c+0x20> 10234: 00070813 mv a6,a4 10238: 00058893 mv a7,a1 1023c: 01f89793 sll a5,a7,0x1f 10240: 00185513 srl a0,a6,0x1 10244: 00a7e533 or a0,a5,a0 10248: 0018d793 srl a5,a7,0x1 1024c: 01056833 or a6,a0,a6 10250: 0117e7b3 or a5,a5,a7 10254: 01e79893 sll a7,a5,0x1e 10258: 00285513 srl a0,a6,0x2 1025c: 00a8e533 or a0,a7,a0 10260: 0027d893 srl a7,a5,0x2 10264: 00a86833 or a6,a6,a0 10268: 0117e7b3 or a5,a5,a7 1026c: 01c79893 sll a7,a5,0x1c 10270: 00485513 srl a0,a6,0x4 10274: 00a8e533 or a0,a7,a0 10278: 0047d893 srl a7,a5,0x4 1027c: 00a86833 or a6,a6,a0 10280: 0117e7b3 or a5,a5,a7 10284: 01879893 sll a7,a5,0x18 10288: 00885513 srl a0,a6,0x8 1028c: 00a8e533 or a0,a7,a0 10290: 0087d893 srl a7,a5,0x8 10294: 00a86833 or a6,a6,a0 10298: 0117e7b3 or a5,a5,a7 1029c: 01079893 sll a7,a5,0x10 102a0: 01085513 srl a0,a6,0x10 102a4: 00a8e533 or a0,a7,a0 102a8: 0107d893 srl a7,a5,0x10 102ac: 00a86833 or a6,a6,a0 102b0: 0117e7b3 or a5,a5,a7 102b4: 00f86833 or a6,a6,a5 102b8: 01f79893 sll a7,a5,0x1f 102bc: 00185513 srl a0,a6,0x1 102c0: 00a8e533 or a0,a7,a0 102c4: 0017d313 srl t1,a5,0x1 102c8: 555558b7 lui a7,0x55555 102cc: 55588893 add a7,a7,1365 # 55555555 <__BSS_END__+0x555377d1> 102d0: 01157533 and a0,a0,a7 102d4: 011378b3 and a7,t1,a7 102d8: 40a80533 sub a0,a6,a0 102dc: 00a83833 sltu a6,a6,a0 102e0: 411787b3 sub a5,a5,a7 102e4: 410787b3 sub a5,a5,a6 102e8: 01e79893 sll a7,a5,0x1e 102ec: 00255813 srl a6,a0,0x2 102f0: 0108e833 or a6,a7,a6 102f4: 0027d313 srl t1,a5,0x2 102f8: 333338b7 lui a7,0x33333 102fc: 33388893 add a7,a7,819 # 33333333 <__BSS_END__+0x333155af> 10300: 01187833 and a6,a6,a7 10304: 01137333 and t1,t1,a7 10308: 01157533 and a0,a0,a7 1030c: 0117f7b3 and a5,a5,a7 10310: 00a80533 add a0,a6,a0 10314: 01053833 sltu a6,a0,a6 10318: 00f307b3 add a5,t1,a5 1031c: 00f80833 add a6,a6,a5 10320: 01c81893 sll a7,a6,0x1c 10324: 00455793 srl a5,a0,0x4 10328: 00f8e7b3 or a5,a7,a5 1032c: 00485893 srl a7,a6,0x4 10330: 00a78533 add a0,a5,a0 10334: 00f537b3 sltu a5,a0,a5 10338: 01088833 add a6,a7,a6 1033c: 010787b3 add a5,a5,a6 10340: 0f0f1837 lui a6,0xf0f1 10344: f0f80813 add a6,a6,-241 # f0f0f0f <__BSS_END__+0xf0d318b> 10348: 01057533 and a0,a0,a6 1034c: 0107f7b3 and a5,a5,a6 10350: 01879893 sll a7,a5,0x18 10354: 00855813 srl a6,a0,0x8 10358: 0108e833 or a6,a7,a6 1035c: 0087d893 srl a7,a5,0x8 10360: 01050833 add a6,a0,a6 10364: 00a83533 sltu a0,a6,a0 10368: 011787b3 add a5,a5,a7 1036c: 00f50533 add a0,a0,a5 10370: 01051893 sll a7,a0,0x10 10374: 01085793 srl a5,a6,0x10 10378: 00f8e7b3 or a5,a7,a5 1037c: 01055893 srl a7,a0,0x10 10380: 00f807b3 add a5,a6,a5 10384: 0107b833 sltu a6,a5,a6 10388: 01150533 add a0,a0,a7 1038c: 00a80833 add a6,a6,a0 10390: 010787b3 add a5,a5,a6 10394: 07f7f793 and a5,a5,127 10398: 04f05463 blez a5,103e0 <HammingDistance_c+0x1c4> 1039c: 00000513 li a0,0 103a0: 00c74833 xor a6,a4,a2 103a4: 00187813 and a6,a6,1 103a8: 01050533 add a0,a0,a6 103ac: 01f59813 sll a6,a1,0x1f 103b0: 00175713 srl a4,a4,0x1 103b4: 00e86733 or a4,a6,a4 103b8: 0015d593 srl a1,a1,0x1 103bc: 01f69813 sll a6,a3,0x1f 103c0: 00165613 srl a2,a2,0x1 103c4: 00c86633 or a2,a6,a2 103c8: 0016d693 srl a3,a3,0x1 103cc: fff78793 add a5,a5,-1 103d0: 01079793 sll a5,a5,0x10 103d4: 4107d793 sra a5,a5,0x10 103d8: fc0794e3 bnez a5,103a0 <HammingDistance_c+0x184> 103dc: 00008067 ret 103e0: 00000513 li a0,0 103e4: 00008067 ret ``` ::: Almost all load/store has become register. count_leading_zero was inlined. A lot of `shift` and `or`, probably because 64 bit shift is tedious. #### gcc O2 optimization > dec:51666, cycle:2866, result: (correct) Suprisingly slower than O1. And bigger code size. Really counter intuitive. #### gcc -Os optimization > dec:55102, cycle:2798, result: (correct) Suprisingly faster than O2. And slightly smaller code size. #### gcc -Ofast optimization > dec:55122, cycle:2866, result: (correct) Same to O2, I feel like some O2 optimization flags make things worse. #### gcc -O3 optimization > dec:55122, cycle:2866, result: (correct) Same to O2, nothing special. #### original assembly > dec:53414, cycle:3198, result: (incorrect) Slower than O2 but faster than O0, bigger code size, and the result is wrong. :::info Modified version: test1_x0 = 0x0013000000100000; test1_x1 = 0x00000000000FFFFF; test2_x0 = 0x0000000200000001; test2_x1 = 0x7FFFFFFFFFFFFFFE; test3_x0 = 0x000000028370228F; test3_x1 = 0x000000028370228F; This testcase is more general (slight more random instead of consecutive bit difference). The assembly would pass the original testcase. ::: ## Optimize the program ### Fix the assembly mismatch (v2) Upon inspecting the assembly, I found that the workflow of the assembly is quite different from the corresponding c code, so The first thing I had to do was match the c code. #### max_digit issue There are several places that didn't follow the c code, such as: ```c= int32_t max_digit_x0 = 64 - count_leading_zeros(x0); int32_t max_digit_x1 = 64 - count_leading_zeros(x1); int32_t max_digit = (max_digit_x0 > max_digit_x1) ? max_digit_x0 : max_digit_x1; ``` instead of original: ```c= int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1); ``` It doesn't lead to incorrect result, but additional `count_leading_zeros` function call does make assembly slower. #### c1 c2 issue And also c1 and c2 assignment is wrong, the assembly can be interpreted as ```c= uint32_t c1, c2; if(max_digit > 32) { c1 = (x0 >> 32) & 1; c2 = (x1 >> 32) & 1; } else { c1 = x0 & 1; c2 = x1 & 1; } ``` instead of intended: ```c= uint64_t c1 = x0 & 1; uint64_t c2 = x1 & 1; if(c1 != c2) Hdist += 1; ``` #### after fixing > dec:56826, cycle:2566, result: (correct) Slightly slower than -O1 optimization, indicating that the code is almost optimized. Code size is larger than O1, but that can be improved. ### Rewrite print function (v2) `printf` is bulky, and we don't actually need that much code size to perform general purpose print function. Therefore, I rewrited print function to improve so the code size will be much smaller.<br> It shouldn't affect cycle since cycle measurement doesn't include any print function. I replace ```c= #include <stdio.h> //... printf("Hamming Distance:%ld\n", d1); // 24 ``` with ```c extern void printstr(const char *, uint32_t); extern void printchar(char); extern void printint(int32_t); //... printstr("Hamming Distance:", 18); printint(d1); printchar('\n'); ``` which results in > dec:11216, cycle:2566, result: (correct) Unsuprisingly, much smaller code size. Cycle count didn't change because cycle measurement didn't include this part. ### assembly optimization (v3) #### counting_leading_zero optimization When dealing with int64 right shift, we need to perform three `shift` and one `or` operation, which is a lot. But since leading zero is either in higher word or lower word, we only need to do it once.<br> Depend on the higher bit, there are two situation: * Higher bit is zero, in this case only the lower word is important, therefore the answer is normal int32 count_leading_zero + 32. * Higher bit is not zero, in this case only the higher word is important, therefore the answer is normal int32 count_leading_zero. ```c= static int32_t count_leading_zero(uint32_t x, uint32_t y) { if(y != 0) { x = y; y = 32; } x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (64 - y - (x & 0x7f)); } ``` #### while loop optimization Another place where 64 bit shift can be simplified. Because the purpose of this loop is to traverse all digits, we can loop them seperately without two part interacting with each other. ```c= //int32_t HammingDistancev2_c(uint32_t x0, uint32_t y0, uint32_t x1, uint32_t y1); while(max_digit > 32) { Hdist += (y0 ^ y1) & 1; y0 = y0 >> 1; y1 = y1 >> 1; max_digit -= 1; } while(max_digit > 0) { Hdist += (x0 ^ x1) & 1; x0 = x0 >> 1; x1 = x1 >> 1; max_digit -= 1; } ``` #### result I used [autorun.sh](https://github.com/kc71486/Computer-Architecture/blob/main/hw2/version/v3/autorun.sh) to automate these process.<br> :::warning The script can only run in bash environment, using shell will cause compile error. ::: :::spoiler autorun script: ```bash= #!/bin/bash function runprogram() { if [ $1 -ne 0 ] then echo "make error, exit" exit fi echo "program size:" > "out${2}" riscv-none-elf-size hammingdistance.elf >> "out${2}" echo "execution result:" >> "out${2}" rv32emu hammingdistance.elf >> "out${2}" riscv-none-elf-objdump -d hammingdistance.elf >> "dump${2}" make clean } optims=( "-O0" "-O1" "-O2" "-O3" "-Os" "-Ofast" ) if [ $# -eq 0 ] then make runprogram $? "-O0" elif [ "$1" = "all" ] then for i in "${optims[@]}" do make OLVL=$i runprogram $? $i done elif [ "$1" = "asm" ] then make OLVL=-O2 runprogram $? "-asm" else make OLVL=$1 runprogram $? $1 fi echo "finished" ``` ::: > program size: > textdatabssdec > 9092142486811384 > execution result: > Elapse cycle:1406 > (result: correct) Worse in size, but much better in speed. ### cleanup (v4) #### standardize function arguement Let `HammingDistance_c` and `HammingDistance_s` have identical function type, so they are easily interchangable. This will increase cycles in c O0 optimization. #### add instruction count Add instruction count in main. Note that cycle measurement is in instruction count measurement, so instruction count might end up more than cycle count. ``` uint32_t startinstr = get_instret(); uint32_t startcycle = get_cycles(); int32_t d1 = HammingDistance_c(t1_x0, t1_y0, t1_x1, t1_y1); int32_t d2 = HammingDistance_c(t2_x0, t2_y0, t2_x1, t2_y1); int32_t d3 = HammingDistance_c(t3_x0, t3_y0, t3_x1, t3_y1); uint32_t endcycle = get_cycles(); uint32_t endinstr = get_instret(); ``` #### assembly bugfix + rigister move In previous version, I used `blt` and `bgt` instead of `bltu` and `bgtu`, this will cause some error in some edge case. Fixing it will cause cycle to increase.<br> After inline count_leading_zero, the register available is enough to totally remove all saved registers. Also, since the function don't call anything, return address no longer need to be saved.<br> #### modify makefile and bash script Split data section and functions in object file using `-fdata-sections -ffunction-sections`, and auto clear unused function using `--gc-sections`. Doing so will decrease the code size but increase the cycle. :::spoiler core part of makefile ``` OLVL := -O0 ABIFLAGS := -march=rv32i_zicsr -mabi=ilp32 GCCFLAGS := -fdata-sections -ffunction-sections CFLAGS := -Wall -Wextra $(OLVL) LDFLAGS := -Wl,--gc-sections %.o: %.s $(CROSS_COMPILE)gcc $(GCCFLAGS) $(ABIFLAGS) -c -o $@ $< %.o: %.c $(CROSS_COMPILE)gcc $(GCCFLAGS) $(ABIFLAGS) $(CFLAGS) -c -o $@ $< $(BIN): $(OBJS) $(CROSS_COMPILE)gcc $(LDFLAGS) -o $@ $^ ``` ::: Add more unimportant things in bash script, nothing special. #### result <b>c implementation O0:</b> > program size: > textdatabssdec > 41243647845272 > execution result: > Elapse cycle:8570 > Instruction count:8588 > (result: correct) More cycle than first version (+1053 cycles) because more features are added, and apparently unoptimized new features have awful performance. Much smaller code size because most unused function are removed.<br> <b>c implementation O1:</b> > program size: > textdatabssdec > 26563127803748 > execution result: > Elapse cycle:2692 > Instruction count:2706 > (result: correct) Slightly more cycle than first version (+130 cycles) because more features are added. Also much smaller code size.<br> <b>asm implementation O0:</b> > textdatabssdec > 27203647843868 > execution result: > Elapse cycle:1608 > Instruction count:1626 > (result: correct) Suprisingly not much more cycle than previous version (+220 cycles). Probably because more feature are added and the bugfix. Also much smaller code size. ## Final Result ### cycles | target | O0 | O1 | O2 | Os | | ------ | ---- | ---- | ---- | ---- | | v1_c | 7517 | 2562 | 2866 | 2798 | | v1_asm | 3198 | 3198 | 3198 | 3198 | | v2_asm | 2566 | 2566 | 2566 | 2566 | | v3_asm | 1406 | 1404 | 1404 | 1404 | | v4_c | 8570 | 2692 | 2693 | 2957 | | v4_asm | 1608 | 1599 | 1599 | 1599 | ### size | target | O0 | O1 | O2 | Os | | ------ | ----- | ----- | ----- | ----- | | v1_c | 56266 | 55106 | 55122 | 55102 | | v1_asm | 56870 | 56870 | 56870 | 56870 | | v2_asm | 11216 | 11216 | 11216 | 11216 | | v3_asm | 11384 | 10088 | 10080 | 9992 | | v4_c | 5272 | 3748 | 3744 | 3656 | | v4_asm | 3868 | 3528 | 3524 | 3440 | appendix: (v4_c, O3): (2693, 4872) ## Almost pure assembly version ### rewrite main function Copy main function from disassembled elf file, and perform these changes: * Manually add data segment. * Change all `gp` referenced variable into `la;lw`. * Change all `jal <address>` into `call <label>`. * Change all `s0` referenced address into `sp` referenced address (useless). ### write another script Write new bash script ([runmains.sh](https://github.com/kc71486/Computer-Architecture/blob/main/hw2/runmains.sh)) that uses `as` and `ld` that compiles mains.s. `gcc` and `as` and `ld` all have unique flags. ### add idiv32 and imod32 My itos function requires integer division and modulo, and since gcc provides these funcion in their standard library. I was able to get away with it. Now I need to copy them and include them into my file. ### add link.ld To remove warning. ### result <img src="https://github.com/kc71486/Computer-Architecture/raw/main/hw2/img/main_s.png" width=100% alt="all assembly"><br> > Elapse cycle:8582^@Instruction count:8596^@Hamming Distance:24^@ ... I don't know why printchar newline doesn't work. This part of main function is identical to c counterpart, and I didn't modify printchar during this step. :::spoiler printchar, shared between main.c and main.s ```=27 # void printchar(char); printchar: la a1, pch # fake string sw a0, 0(a1) # put char into fake string li a0, STDOUT # write to stdout la a2, 1 # length = 1 li a7, SYSWRITE ecall # invoke syscall ret ``` ::: :::spoiler main.s ```=111 li a1,17 la a0,hamstr call printstr lw a0,16(sp) call printint li a0,10 call printchar ``` ::: :::warning You should refine the use of `write` system call. :notes: jserv ::: ## Reference * [perfcounter](https://github.com/sysprog21/rv32emu/tree/master/tests/perfcounter) * [remove unused function](https://stackoverflow.com/questions/6687630/how-to-remove-unused-c-c-symbols-with-gcc-and-ld) * [lab2](https://hackmd.io/@sysprog/rJAufgHYS) * [div.c](https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/lm32/_divsi3.c)