Try   HackMD

Assignment2: RISC-V Toolchain

Contributed by kc71486

tags: RISC-V, jserv

Install riscv rv32emu toolchain

Follow the instruction from this note

Ignore Configure $PATH part, and add riscv-none-elf-gcc and rv32emu into ~/.bashrc instead.

# add riscv-none-elf-gcc into PATH export PATH="$PATH:~/riscv-none-elf-gcc/bin" # add rv32emu into PATH export PATH="$PATH:~/rv32emu/build"

Then source the .bashrc file

source .bashrc

Pick a Program

The following question is picked from the Assignment 1

唐飴苹 Calculate the Hamming Distance using Counting Leading Zeros
The Hamming Distance between two integers is defined as the number of differing bits at the same position when comparing the binary representations of the integers. For example, the Hamming Distance between 1011101 and 1001001 is 2.
In the assignment, I implement the program to calculate the Hamming Distance between the two given 64-bit unsigned integers.

Source code
The original implementation of the questions is as follows.

int32_t HammingDistance_c(uint64_t x0, uint64_t x1) { int Hdist = 0; int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1); while(max_digit > 0){ uint64_t c1 = x0 & 1; uint64_t c2 = x1 & 1; if(c1 != c2) Hdist += 1; x0 = x0 >> 1; x1 = x1 >> 1; max_digit -= 1; } return Hdist; }

There is an easier way to achieve this, but since it is probably not how we should do the assignment, I am not doing it.

Easier way:

int32_t HammingDistance_c(uint64_t x0, uint64_t x1) {
    int64_t diff = x0 ^ x1;
    return population_count(diff);
}

The reason I choose this question is that hamming code is used in error correcting code, which is widely used in main memory, especially in servers, where reliability is very important.

Compile / excecute the code

To compile the .c file, I wrote makefile.

makefile
.PHONY: clean
include toolchain.mk
ASFLAGS = -march=rv32i_zicsr -mabi=ilp32
CFLAGS = -O0 -Wall -Wextra
OBJS = \
    getcycles.o \
    main.o \
    hammingc_v0.o \
    hammings_v0.o
BIN = hammingdistance.elf
%.o: %.s
	riscv-none-elf-gcc $(ASFLAGS) -c -o $@ $<
%.o: %.c
	riscv-none-elf-gcc $(ASFLAGS) $(CFLAGS) -c -o $@ $<
all: $(BIN)
$(BIN): $(OBJS)
	riscv-none-elf-gcc -o $@ $^

And use make to compile.

There are some modification I need to make:

  • Modify assembly flag (include csr) to get current cycle.
  • Link with riscv-none-elf-gcc (instead of riscv-none-elf-ld) in order to use standard library such as printf function.
  • Change CFLAGS according to the optimization level

For every test, I run the shell that includes:

riscv-none-elf-readelf -h hammingdistance.elf
echo ""
riscv-none-elf-size hammingdistance.elf
echo ""
rv32emu hammingdistance.elf

gcc O0 optimization

dec:56266, cycle:7517, result: (correct)

O0 is the baseline of everything. Nothing special other than large program size.

Disassembled
00010220 <count_leading_zeros>:
   10220:	f7010113          	add	sp,sp,-144
   10224:	08812623          	sw	s0,140(sp)
   10228:	09212423          	sw	s2,136(sp)
   1022c:	09312223          	sw	s3,132(sp)
   10230:	09412023          	sw	s4,128(sp)
   10234:	07512e23          	sw	s5,124(sp)
   10238:	07612c23          	sw	s6,120(sp)
   1023c:	07712a23          	sw	s7,116(sp)
   10240:	07812823          	sw	s8,112(sp)
   10244:	07912623          	sw	s9,108(sp)
   10248:	07a12423          	sw	s10,104(sp)
   1024c:	07b12223          	sw	s11,100(sp)
   10250:	09010413          	add	s0,sp,144
   10254:	fca42423          	sw	a0,-56(s0)
   10258:	fcb42623          	sw	a1,-52(s0)
   1025c:	fcc42783          	lw	a5,-52(s0)
   10260:	01f79713          	sll	a4,a5,0x1f
   10264:	fc842783          	lw	a5,-56(s0)
   10268:	0017d613          	srl	a2,a5,0x1
   1026c:	00c76633          	or	a2,a4,a2
   10270:	fcc42783          	lw	a5,-52(s0)
   10274:	0017d693          	srl	a3,a5,0x1
   10278:	fc842783          	lw	a5,-56(s0)
   1027c:	00c7ee33          	or	t3,a5,a2
   10280:	fcc42783          	lw	a5,-52(s0)
   10284:	00d7eeb3          	or	t4,a5,a3
   10288:	fdc42423          	sw	t3,-56(s0)
   1028c:	fdd42623          	sw	t4,-52(s0)
   10290:	fcc42783          	lw	a5,-52(s0)
   10294:	01e79713          	sll	a4,a5,0x1e
   10298:	fc842783          	lw	a5,-56(s0)
   1029c:	0027d813          	srl	a6,a5,0x2
   102a0:	01076833          	or	a6,a4,a6
   102a4:	fcc42783          	lw	a5,-52(s0)
   102a8:	0027d893          	srl	a7,a5,0x2
   102ac:	fc842783          	lw	a5,-56(s0)
   102b0:	0107e7b3          	or	a5,a5,a6
   102b4:	f8f42823          	sw	a5,-112(s0)
   102b8:	fcc42783          	lw	a5,-52(s0)
   102bc:	0117e7b3          	or	a5,a5,a7
   102c0:	f8f42a23          	sw	a5,-108(s0)
   102c4:	f9042783          	lw	a5,-112(s0)
   102c8:	f9442803          	lw	a6,-108(s0)
   102cc:	fcf42423          	sw	a5,-56(s0)
   102d0:	fd042623          	sw	a6,-52(s0)
   102d4:	fcc42783          	lw	a5,-52(s0)
   102d8:	01c79713          	sll	a4,a5,0x1c
   102dc:	fc842783          	lw	a5,-56(s0)
   102e0:	0047d313          	srl	t1,a5,0x4
   102e4:	00676333          	or	t1,a4,t1
   102e8:	fcc42783          	lw	a5,-52(s0)
   102ec:	0047d393          	srl	t2,a5,0x4
   102f0:	fc842783          	lw	a5,-56(s0)
   102f4:	0067e7b3          	or	a5,a5,t1
   102f8:	f8f42423          	sw	a5,-120(s0)
   102fc:	fcc42783          	lw	a5,-52(s0)
   10300:	0077e7b3          	or	a5,a5,t2
   10304:	f8f42623          	sw	a5,-116(s0)
   10308:	f8842783          	lw	a5,-120(s0)
   1030c:	f8c42803          	lw	a6,-116(s0)
   10310:	fcf42423          	sw	a5,-56(s0)
   10314:	fd042623          	sw	a6,-52(s0)
   10318:	fcc42783          	lw	a5,-52(s0)
   1031c:	01879713          	sll	a4,a5,0x18
   10320:	fc842783          	lw	a5,-56(s0)
   10324:	0087d913          	srl	s2,a5,0x8
   10328:	01276933          	or	s2,a4,s2
   1032c:	fcc42783          	lw	a5,-52(s0)
   10330:	0087d993          	srl	s3,a5,0x8
   10334:	fc842783          	lw	a5,-56(s0)
   10338:	0127e7b3          	or	a5,a5,s2
   1033c:	f8f42023          	sw	a5,-128(s0)
   10340:	fcc42783          	lw	a5,-52(s0)
   10344:	0137e7b3          	or	a5,a5,s3
   10348:	f8f42223          	sw	a5,-124(s0)
   1034c:	f8042783          	lw	a5,-128(s0)
   10350:	f8442803          	lw	a6,-124(s0)
   10354:	fcf42423          	sw	a5,-56(s0)
   10358:	fd042623          	sw	a6,-52(s0)
   1035c:	fcc42783          	lw	a5,-52(s0)
   10360:	01079713          	sll	a4,a5,0x10
   10364:	fc842783          	lw	a5,-56(s0)
   10368:	0107d793          	srl	a5,a5,0x10
   1036c:	fcf42023          	sw	a5,-64(s0)
   10370:	fc042783          	lw	a5,-64(s0)
   10374:	00f767b3          	or	a5,a4,a5
   10378:	fcf42023          	sw	a5,-64(s0)
   1037c:	fcc42783          	lw	a5,-52(s0)
   10380:	0107d793          	srl	a5,a5,0x10
   10384:	fcf42223          	sw	a5,-60(s0)
   10388:	fc842783          	lw	a5,-56(s0)
   1038c:	fc042603          	lw	a2,-64(s0)
   10390:	fc442683          	lw	a3,-60(s0)
   10394:	00060713          	mv	a4,a2
   10398:	00e7e7b3          	or	a5,a5,a4
   1039c:	f6f42c23          	sw	a5,-136(s0)
   103a0:	fcc42783          	lw	a5,-52(s0)
   103a4:	00068713          	mv	a4,a3
   103a8:	00e7e7b3          	or	a5,a5,a4
   103ac:	f6f42e23          	sw	a5,-132(s0)
   103b0:	f7842783          	lw	a5,-136(s0)
   103b4:	f7c42803          	lw	a6,-132(s0)
   103b8:	fcf42423          	sw	a5,-56(s0)
   103bc:	fd042623          	sw	a6,-52(s0)
   103c0:	fcc42783          	lw	a5,-52(s0)
   103c4:	0007d793          	srl	a5,a5,0x0
   103c8:	faf42c23          	sw	a5,-72(s0)
   103cc:	fa042e23          	sw	zero,-68(s0)
   103d0:	fc842783          	lw	a5,-56(s0)
   103d4:	fb842603          	lw	a2,-72(s0)
   103d8:	fbc42683          	lw	a3,-68(s0)
   103dc:	00060713          	mv	a4,a2
   103e0:	00e7e7b3          	or	a5,a5,a4
   103e4:	f6f42823          	sw	a5,-144(s0)
   103e8:	fcc42783          	lw	a5,-52(s0)
   103ec:	00068713          	mv	a4,a3
   103f0:	00e7e7b3          	or	a5,a5,a4
   103f4:	f6f42a23          	sw	a5,-140(s0)
   103f8:	f7042783          	lw	a5,-144(s0)
   103fc:	f7442803          	lw	a6,-140(s0)
   10400:	fcf42423          	sw	a5,-56(s0)
   10404:	fd042623          	sw	a6,-52(s0)
   10408:	fcc42783          	lw	a5,-52(s0)
   1040c:	01f79793          	sll	a5,a5,0x1f
   10410:	fc842703          	lw	a4,-56(s0)
   10414:	00175d13          	srl	s10,a4,0x1
   10418:	01a7ed33          	or	s10,a5,s10
   1041c:	fcc42783          	lw	a5,-52(s0)
   10420:	0017dd93          	srl	s11,a5,0x1
   10424:	555557b7          	lui	a5,0x55555
   10428:	55578793          	add	a5,a5,1365 # 55555555 <__BSS_END__+0x55536691>
   1042c:	00fd77b3          	and	a5,s10,a5
   10430:	faf42823          	sw	a5,-80(s0)
   10434:	555557b7          	lui	a5,0x55555
   10438:	55578793          	add	a5,a5,1365 # 55555555 <__BSS_END__+0x55536691>
   1043c:	00fdf7b3          	and	a5,s11,a5
   10440:	faf42a23          	sw	a5,-76(s0)
   10444:	fc842603          	lw	a2,-56(s0)
   10448:	fcc42683          	lw	a3,-52(s0)
   1044c:	fb042803          	lw	a6,-80(s0)
   10450:	fb442883          	lw	a7,-76(s0)
   10454:	00080593          	mv	a1,a6
   10458:	40b60733          	sub	a4,a2,a1
   1045c:	00070593          	mv	a1,a4
   10460:	00b635b3          	sltu	a1,a2,a1
   10464:	00088513          	mv	a0,a7
   10468:	40a687b3          	sub	a5,a3,a0
   1046c:	40b786b3          	sub	a3,a5,a1
   10470:	00068793          	mv	a5,a3
   10474:	fce42423          	sw	a4,-56(s0)
   10478:	fcf42623          	sw	a5,-52(s0)
   1047c:	fcc42783          	lw	a5,-52(s0)
   10480:	01e79793          	sll	a5,a5,0x1e
   10484:	fc842703          	lw	a4,-56(s0)
   10488:	00275c13          	srl	s8,a4,0x2
   1048c:	0187ec33          	or	s8,a5,s8
   10490:	fcc42783          	lw	a5,-52(s0)
   10494:	0027dc93          	srl	s9,a5,0x2
   10498:	333337b7          	lui	a5,0x33333
   1049c:	33378793          	add	a5,a5,819 # 33333333 <__BSS_END__+0x3331446f>
   104a0:	00fc77b3          	and	a5,s8,a5
   104a4:	faf42423          	sw	a5,-88(s0)
   104a8:	333337b7          	lui	a5,0x33333
   104ac:	33378793          	add	a5,a5,819 # 33333333 <__BSS_END__+0x3331446f>
   104b0:	00fcf7b3          	and	a5,s9,a5
   104b4:	faf42623          	sw	a5,-84(s0)
   104b8:	fc842703          	lw	a4,-56(s0)
   104bc:	333337b7          	lui	a5,0x33333
   104c0:	33378793          	add	a5,a5,819 # 33333333 <__BSS_END__+0x3331446f>
   104c4:	00f777b3          	and	a5,a4,a5
   104c8:	faf42023          	sw	a5,-96(s0)
   104cc:	fcc42703          	lw	a4,-52(s0)
   104d0:	333337b7          	lui	a5,0x33333
   104d4:	33378793          	add	a5,a5,819 # 33333333 <__BSS_END__+0x3331446f>
   104d8:	00f777b3          	and	a5,a4,a5
   104dc:	faf42223          	sw	a5,-92(s0)
   104e0:	fa842503          	lw	a0,-88(s0)
   104e4:	fac42583          	lw	a1,-84(s0)
   104e8:	00050693          	mv	a3,a0
   104ec:	fa042803          	lw	a6,-96(s0)
   104f0:	fa442883          	lw	a7,-92(s0)
   104f4:	00080613          	mv	a2,a6
   104f8:	00c68733          	add	a4,a3,a2
   104fc:	00070693          	mv	a3,a4
   10500:	00050613          	mv	a2,a0
   10504:	00c6b6b3          	sltu	a3,a3,a2
   10508:	00058613          	mv	a2,a1
   1050c:	00088593          	mv	a1,a7
   10510:	00b607b3          	add	a5,a2,a1
   10514:	00f686b3          	add	a3,a3,a5
   10518:	00068793          	mv	a5,a3
   1051c:	fce42423          	sw	a4,-56(s0)
   10520:	fcf42623          	sw	a5,-52(s0)
   10524:	fcc42783          	lw	a5,-52(s0)
   10528:	01c79793          	sll	a5,a5,0x1c
   1052c:	fc842703          	lw	a4,-56(s0)
   10530:	00475f13          	srl	t5,a4,0x4
   10534:	01e7ef33          	or	t5,a5,t5
   10538:	fcc42783          	lw	a5,-52(s0)
   1053c:	0047df93          	srl	t6,a5,0x4
   10540:	fc842603          	lw	a2,-56(s0)
   10544:	fcc42683          	lw	a3,-52(s0)
   10548:	00cf0733          	add	a4,t5,a2
   1054c:	00070593          	mv	a1,a4
   10550:	01e5b5b3          	sltu	a1,a1,t5
   10554:	00df87b3          	add	a5,t6,a3
   10558:	00f586b3          	add	a3,a1,a5
   1055c:	00068793          	mv	a5,a3
   10560:	0f0f16b7          	lui	a3,0xf0f1
   10564:	f0f68693          	add	a3,a3,-241 # f0f0f0f <__BSS_END__+0xf0d204b>
   10568:	00d776b3          	and	a3,a4,a3
   1056c:	fcd42423          	sw	a3,-56(s0)
   10570:	0f0f16b7          	lui	a3,0xf0f1
   10574:	f0f68693          	add	a3,a3,-241 # f0f0f0f <__BSS_END__+0xf0d204b>
   10578:	00d7f7b3          	and	a5,a5,a3
   1057c:	fcf42623          	sw	a5,-52(s0)
   10580:	fcc42783          	lw	a5,-52(s0)
   10584:	01879793          	sll	a5,a5,0x18
   10588:	fc842703          	lw	a4,-56(s0)
   1058c:	00875b13          	srl	s6,a4,0x8
   10590:	0167eb33          	or	s6,a5,s6
   10594:	fcc42783          	lw	a5,-52(s0)
   10598:	0087db93          	srl	s7,a5,0x8
   1059c:	fc842603          	lw	a2,-56(s0)
   105a0:	fcc42683          	lw	a3,-52(s0)
   105a4:	01660733          	add	a4,a2,s6
   105a8:	00070593          	mv	a1,a4
   105ac:	00c5b5b3          	sltu	a1,a1,a2
   105b0:	017687b3          	add	a5,a3,s7
   105b4:	00f586b3          	add	a3,a1,a5
   105b8:	00068793          	mv	a5,a3
   105bc:	fce42423          	sw	a4,-56(s0)
   105c0:	fcf42623          	sw	a5,-52(s0)
   105c4:	fcc42783          	lw	a5,-52(s0)
   105c8:	01079793          	sll	a5,a5,0x10
   105cc:	fc842703          	lw	a4,-56(s0)
   105d0:	01075a13          	srl	s4,a4,0x10
   105d4:	0147ea33          	or	s4,a5,s4
   105d8:	fcc42783          	lw	a5,-52(s0)
   105dc:	0107da93          	srl	s5,a5,0x10
   105e0:	fc842603          	lw	a2,-56(s0)
   105e4:	fcc42683          	lw	a3,-52(s0)
   105e8:	01460733          	add	a4,a2,s4
   105ec:	00070593          	mv	a1,a4
   105f0:	00c5b5b3          	sltu	a1,a1,a2
   105f4:	015687b3          	add	a5,a3,s5
   105f8:	00f586b3          	add	a3,a1,a5
   105fc:	00068793          	mv	a5,a3
   10600:	fce42423          	sw	a4,-56(s0)
   10604:	fcf42623          	sw	a5,-52(s0)
   10608:	fcc42783          	lw	a5,-52(s0)
   1060c:	0007d793          	srl	a5,a5,0x0
   10610:	f8f42c23          	sw	a5,-104(s0)
   10614:	f8042e23          	sw	zero,-100(s0)
   10618:	fc842603          	lw	a2,-56(s0)
   1061c:	fcc42683          	lw	a3,-52(s0)
   10620:	f9842803          	lw	a6,-104(s0)
   10624:	f9c42883          	lw	a7,-100(s0)
   10628:	00080593          	mv	a1,a6
   1062c:	00b60733          	add	a4,a2,a1
   10630:	00070593          	mv	a1,a4
   10634:	00c5b5b3          	sltu	a1,a1,a2
   10638:	00088513          	mv	a0,a7
   1063c:	00a687b3          	add	a5,a3,a0
   10640:	00f586b3          	add	a3,a1,a5
   10644:	00068793          	mv	a5,a3
   10648:	fce42423          	sw	a4,-56(s0)
   1064c:	fcf42623          	sw	a5,-52(s0)
   10650:	fc845783          	lhu	a5,-56(s0)
   10654:	07f7f793          	and	a5,a5,127
   10658:	01079793          	sll	a5,a5,0x10
   1065c:	0107d793          	srl	a5,a5,0x10
   10660:	04000713          	li	a4,64
   10664:	40f707b3          	sub	a5,a4,a5
   10668:	01079793          	sll	a5,a5,0x10
   1066c:	0107d793          	srl	a5,a5,0x10
   10670:	00078513          	mv	a0,a5
   10674:	08c12403          	lw	s0,140(sp)
   10678:	08812903          	lw	s2,136(sp)
   1067c:	08412983          	lw	s3,132(sp)
   10680:	08012a03          	lw	s4,128(sp)
   10684:	07c12a83          	lw	s5,124(sp)
   10688:	07812b03          	lw	s6,120(sp)
   1068c:	07412b83          	lw	s7,116(sp)
   10690:	07012c03          	lw	s8,112(sp)
   10694:	06c12c83          	lw	s9,108(sp)
   10698:	06812d03          	lw	s10,104(sp)
   1069c:	06412d83          	lw	s11,100(sp)
   106a0:	09010113          	add	sp,sp,144
   106a4:	00008067          	ret

000106a8 <HammingDistance_c>:
   106a8:	fa010113          	add	sp,sp,-96
   106ac:	04112e23          	sw	ra,92(sp)
   106b0:	04812c23          	sw	s0,88(sp)
   106b4:	05212a23          	sw	s2,84(sp)
   106b8:	05312823          	sw	s3,80(sp)
   106bc:	05412623          	sw	s4,76(sp)
   106c0:	05512423          	sw	s5,72(sp)
   106c4:	05612223          	sw	s6,68(sp)
   106c8:	05712023          	sw	s7,64(sp)
   106cc:	03812e23          	sw	s8,60(sp)
   106d0:	03912c23          	sw	s9,56(sp)
   106d4:	06010413          	add	s0,sp,96
   106d8:	faa42423          	sw	a0,-88(s0)
   106dc:	fab42623          	sw	a1,-84(s0)
   106e0:	fac42023          	sw	a2,-96(s0)
   106e4:	fad42223          	sw	a3,-92(s0)
   106e8:	fc042623          	sw	zero,-52(s0)
   106ec:	fa842603          	lw	a2,-88(s0)
   106f0:	fac42683          	lw	a3,-84(s0)
   106f4:	fa042703          	lw	a4,-96(s0)
   106f8:	fa442783          	lw	a5,-92(s0)
   106fc:	00068513          	mv	a0,a3
   10700:	00078593          	mv	a1,a5
   10704:	00a5ee63          	bltu	a1,a0,10720 <HammingDistance_c+0x78>
   10708:	00068513          	mv	a0,a3
   1070c:	00078593          	mv	a1,a5
   10710:	00b51c63          	bne	a0,a1,10728 <HammingDistance_c+0x80>
   10714:	00060513          	mv	a0,a2
   10718:	00070593          	mv	a1,a4
   1071c:	00a5f663          	bgeu	a1,a0,10728 <HammingDistance_c+0x80>
   10720:	00060713          	mv	a4,a2
   10724:	00068793          	mv	a5,a3
   10728:	00070513          	mv	a0,a4
   1072c:	00078593          	mv	a1,a5
   10730:	af1ff0ef          	jal	10220 <count_leading_zeros>
   10734:	00050793          	mv	a5,a0
   10738:	00078713          	mv	a4,a5
   1073c:	04000793          	li	a5,64
   10740:	40e787b3          	sub	a5,a5,a4
   10744:	01079793          	sll	a5,a5,0x10
   10748:	0107d793          	srl	a5,a5,0x10
   1074c:	fcf41523          	sh	a5,-54(s0)
   10750:	0b40006f          	j	10804 <HammingDistance_c+0x15c>
   10754:	fa842783          	lw	a5,-88(s0)
   10758:	0017fb13          	and	s6,a5,1
   1075c:	fac42783          	lw	a5,-84(s0)
   10760:	0007fb93          	and	s7,a5,0
   10764:	fd642023          	sw	s6,-64(s0)
   10768:	fd742223          	sw	s7,-60(s0)
   1076c:	fa042783          	lw	a5,-96(s0)
   10770:	0017fc13          	and	s8,a5,1
   10774:	fa442783          	lw	a5,-92(s0)
   10778:	0007fc93          	and	s9,a5,0
   1077c:	fb842c23          	sw	s8,-72(s0)
   10780:	fb942e23          	sw	s9,-68(s0)
   10784:	fc042703          	lw	a4,-64(s0)
   10788:	fb842783          	lw	a5,-72(s0)
   1078c:	00f71863          	bne	a4,a5,1079c <HammingDistance_c+0xf4>
   10790:	fc442703          	lw	a4,-60(s0)
   10794:	fbc42783          	lw	a5,-68(s0)
   10798:	00f70863          	beq	a4,a5,107a8 <HammingDistance_c+0x100>
   1079c:	fcc42783          	lw	a5,-52(s0)
   107a0:	00178793          	add	a5,a5,1
   107a4:	fcf42623          	sw	a5,-52(s0)
   107a8:	fac42783          	lw	a5,-84(s0)
   107ac:	01f79793          	sll	a5,a5,0x1f
   107b0:	fa842703          	lw	a4,-88(s0)
   107b4:	00175913          	srl	s2,a4,0x1
   107b8:	0127e933          	or	s2,a5,s2
   107bc:	fac42783          	lw	a5,-84(s0)
   107c0:	0017d993          	srl	s3,a5,0x1
   107c4:	fb242423          	sw	s2,-88(s0)
   107c8:	fb342623          	sw	s3,-84(s0)
   107cc:	fa442783          	lw	a5,-92(s0)
   107d0:	01f79793          	sll	a5,a5,0x1f
   107d4:	fa042703          	lw	a4,-96(s0)
   107d8:	00175a13          	srl	s4,a4,0x1
   107dc:	0147ea33          	or	s4,a5,s4
   107e0:	fa442783          	lw	a5,-92(s0)
   107e4:	0017da93          	srl	s5,a5,0x1
   107e8:	fb442023          	sw	s4,-96(s0)
   107ec:	fb542223          	sw	s5,-92(s0)
   107f0:	fca45783          	lhu	a5,-54(s0)
   107f4:	fff78793          	add	a5,a5,-1
   107f8:	01079793          	sll	a5,a5,0x10
   107fc:	0107d793          	srl	a5,a5,0x10
   10800:	fcf41523          	sh	a5,-54(s0)
   10804:	fca41783          	lh	a5,-54(s0)
   10808:	f4f046e3          	bgtz	a5,10754 <HammingDistance_c+0xac>
   1080c:	fcc42783          	lw	a5,-52(s0)
   10810:	00078513          	mv	a0,a5
   10814:	05c12083          	lw	ra,92(sp)
   10818:	05812403          	lw	s0,88(sp)
   1081c:	05412903          	lw	s2,84(sp)
   10820:	05012983          	lw	s3,80(sp)
   10824:	04c12a03          	lw	s4,76(sp)
   10828:	04812a83          	lw	s5,72(sp)
   1082c:	04412b03          	lw	s6,68(sp)
   10830:	04012b83          	lw	s7,64(sp)
   10834:	03c12c03          	lw	s8,60(sp)
   10838:	03812c83          	lw	s9,56(sp)
   1083c:	06010113          	add	sp,sp,96
   10840:	00008067          	ret

A lot of load/store, no wonder it is so slow.

gcc O1 optimization

dec:55106, cycle:2562, result: (correct)

Much faster than O0. Also smaller code size.

Disassembled
0001021c <HammingDistance_c>:
   1021c:	00050713          	mv	a4,a0
   10220:	00060813          	mv	a6,a2
   10224:	00068893          	mv	a7,a3
   10228:	00b6e663          	bltu	a3,a1,10234 <HammingDistance_c+0x18>
   1022c:	00d59863          	bne	a1,a3,1023c <HammingDistance_c+0x20>
   10230:	00a67663          	bgeu	a2,a0,1023c <HammingDistance_c+0x20>
   10234:	00070813          	mv	a6,a4
   10238:	00058893          	mv	a7,a1
   1023c:	01f89793          	sll	a5,a7,0x1f
   10240:	00185513          	srl	a0,a6,0x1
   10244:	00a7e533          	or	a0,a5,a0
   10248:	0018d793          	srl	a5,a7,0x1
   1024c:	01056833          	or	a6,a0,a6
   10250:	0117e7b3          	or	a5,a5,a7
   10254:	01e79893          	sll	a7,a5,0x1e
   10258:	00285513          	srl	a0,a6,0x2
   1025c:	00a8e533          	or	a0,a7,a0
   10260:	0027d893          	srl	a7,a5,0x2
   10264:	00a86833          	or	a6,a6,a0
   10268:	0117e7b3          	or	a5,a5,a7
   1026c:	01c79893          	sll	a7,a5,0x1c
   10270:	00485513          	srl	a0,a6,0x4
   10274:	00a8e533          	or	a0,a7,a0
   10278:	0047d893          	srl	a7,a5,0x4
   1027c:	00a86833          	or	a6,a6,a0
   10280:	0117e7b3          	or	a5,a5,a7
   10284:	01879893          	sll	a7,a5,0x18
   10288:	00885513          	srl	a0,a6,0x8
   1028c:	00a8e533          	or	a0,a7,a0
   10290:	0087d893          	srl	a7,a5,0x8
   10294:	00a86833          	or	a6,a6,a0
   10298:	0117e7b3          	or	a5,a5,a7
   1029c:	01079893          	sll	a7,a5,0x10
   102a0:	01085513          	srl	a0,a6,0x10
   102a4:	00a8e533          	or	a0,a7,a0
   102a8:	0107d893          	srl	a7,a5,0x10
   102ac:	00a86833          	or	a6,a6,a0
   102b0:	0117e7b3          	or	a5,a5,a7
   102b4:	00f86833          	or	a6,a6,a5
   102b8:	01f79893          	sll	a7,a5,0x1f
   102bc:	00185513          	srl	a0,a6,0x1
   102c0:	00a8e533          	or	a0,a7,a0
   102c4:	0017d313          	srl	t1,a5,0x1
   102c8:	555558b7          	lui	a7,0x55555
   102cc:	55588893          	add	a7,a7,1365 # 55555555 <__BSS_END__+0x555377d1>
   102d0:	01157533          	and	a0,a0,a7
   102d4:	011378b3          	and	a7,t1,a7
   102d8:	40a80533          	sub	a0,a6,a0
   102dc:	00a83833          	sltu	a6,a6,a0
   102e0:	411787b3          	sub	a5,a5,a7
   102e4:	410787b3          	sub	a5,a5,a6
   102e8:	01e79893          	sll	a7,a5,0x1e
   102ec:	00255813          	srl	a6,a0,0x2
   102f0:	0108e833          	or	a6,a7,a6
   102f4:	0027d313          	srl	t1,a5,0x2
   102f8:	333338b7          	lui	a7,0x33333
   102fc:	33388893          	add	a7,a7,819 # 33333333 <__BSS_END__+0x333155af>
   10300:	01187833          	and	a6,a6,a7
   10304:	01137333          	and	t1,t1,a7
   10308:	01157533          	and	a0,a0,a7
   1030c:	0117f7b3          	and	a5,a5,a7
   10310:	00a80533          	add	a0,a6,a0
   10314:	01053833          	sltu	a6,a0,a6
   10318:	00f307b3          	add	a5,t1,a5
   1031c:	00f80833          	add	a6,a6,a5
   10320:	01c81893          	sll	a7,a6,0x1c
   10324:	00455793          	srl	a5,a0,0x4
   10328:	00f8e7b3          	or	a5,a7,a5
   1032c:	00485893          	srl	a7,a6,0x4
   10330:	00a78533          	add	a0,a5,a0
   10334:	00f537b3          	sltu	a5,a0,a5
   10338:	01088833          	add	a6,a7,a6
   1033c:	010787b3          	add	a5,a5,a6
   10340:	0f0f1837          	lui	a6,0xf0f1
   10344:	f0f80813          	add	a6,a6,-241 # f0f0f0f <__BSS_END__+0xf0d318b>
   10348:	01057533          	and	a0,a0,a6
   1034c:	0107f7b3          	and	a5,a5,a6
   10350:	01879893          	sll	a7,a5,0x18
   10354:	00855813          	srl	a6,a0,0x8
   10358:	0108e833          	or	a6,a7,a6
   1035c:	0087d893          	srl	a7,a5,0x8
   10360:	01050833          	add	a6,a0,a6
   10364:	00a83533          	sltu	a0,a6,a0
   10368:	011787b3          	add	a5,a5,a7
   1036c:	00f50533          	add	a0,a0,a5
   10370:	01051893          	sll	a7,a0,0x10
   10374:	01085793          	srl	a5,a6,0x10
   10378:	00f8e7b3          	or	a5,a7,a5
   1037c:	01055893          	srl	a7,a0,0x10
   10380:	00f807b3          	add	a5,a6,a5
   10384:	0107b833          	sltu	a6,a5,a6
   10388:	01150533          	add	a0,a0,a7
   1038c:	00a80833          	add	a6,a6,a0
   10390:	010787b3          	add	a5,a5,a6
   10394:	07f7f793          	and	a5,a5,127
   10398:	04f05463          	blez	a5,103e0 <HammingDistance_c+0x1c4>
   1039c:	00000513          	li	a0,0
   103a0:	00c74833          	xor	a6,a4,a2
   103a4:	00187813          	and	a6,a6,1
   103a8:	01050533          	add	a0,a0,a6
   103ac:	01f59813          	sll	a6,a1,0x1f
   103b0:	00175713          	srl	a4,a4,0x1
   103b4:	00e86733          	or	a4,a6,a4
   103b8:	0015d593          	srl	a1,a1,0x1
   103bc:	01f69813          	sll	a6,a3,0x1f
   103c0:	00165613          	srl	a2,a2,0x1
   103c4:	00c86633          	or	a2,a6,a2
   103c8:	0016d693          	srl	a3,a3,0x1
   103cc:	fff78793          	add	a5,a5,-1
   103d0:	01079793          	sll	a5,a5,0x10
   103d4:	4107d793          	sra	a5,a5,0x10
   103d8:	fc0794e3          	bnez	a5,103a0 <HammingDistance_c+0x184>
   103dc:	00008067          	ret
   103e0:	00000513          	li	a0,0
   103e4:	00008067          	ret

Almost all load/store has become register. count_leading_zero was inlined. A lot of shift and or, probably because 64 bit shift is tedious.

gcc O2 optimization

dec:51666, cycle:2866, result: (correct)

Suprisingly slower than O1. And bigger code size. Really counter intuitive.

gcc -Os optimization

dec:55102, cycle:2798, result: (correct)

Suprisingly faster than O2. And slightly smaller code size.

gcc -Ofast optimization

dec:55122, cycle:2866, result: (correct)

Same to O2, I feel like some O2 optimization flags make things worse.

gcc -O3 optimization

dec:55122, cycle:2866, result: (correct)

Same to O2, nothing special.

original assembly

dec:53414, cycle:3198, result: (incorrect)

Slower than O2 but faster than O0, bigger code size, and the result is wrong.

Modified version:
test1_x0 = 0x0013000000100000;
test1_x1 = 0x00000000000FFFFF;
test2_x0 = 0x0000000200000001;
test2_x1 = 0x7FFFFFFFFFFFFFFE;
test3_x0 = 0x000000028370228F;
test3_x1 = 0x000000028370228F;
This testcase is more general (slight more random instead of consecutive bit difference). The assembly would pass the original testcase.

Optimize the program

Fix the assembly mismatch (v2)

Upon inspecting the assembly, I found that the workflow of the assembly is quite different from the corresponding c code, so The first thing I had to do was match the c code.

max_digit issue

There are several places that didn't follow the c code, such as:

int32_t max_digit_x0 = 64 - count_leading_zeros(x0); int32_t max_digit_x1 = 64 - count_leading_zeros(x1); int32_t max_digit = (max_digit_x0 > max_digit_x1) ? max_digit_x0 : max_digit_x1;

instead of original:

int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1);

It doesn't lead to incorrect result, but additional count_leading_zeros function call does make assembly slower.

c1 c2 issue

And also c1 and c2 assignment is wrong, the assembly can be interpreted as

uint32_t c1, c2; if(max_digit > 32) { c1 = (x0 >> 32) & 1; c2 = (x1 >> 32) & 1; } else { c1 = x0 & 1; c2 = x1 & 1; }

instead of intended:

uint64_t c1 = x0 & 1; uint64_t c2 = x1 & 1; if(c1 != c2) Hdist += 1;

after fixing

dec:56826, cycle:2566, result: (correct)

Slightly slower than -O1 optimization, indicating that the code is almost optimized. Code size is larger than O1, but that can be improved.

Rewrite print function (v2)

printf is bulky, and we don't actually need that much code size to perform general purpose print function. Therefore, I rewrited print function to improve so the code size will be much smaller.

It shouldn't affect cycle since cycle measurement doesn't include any print function.
I replace

#include <stdio.h> //... printf("Hamming Distance:%ld\n", d1); // 24

with

extern void printstr(const char *, uint32_t);
extern void printchar(char);
extern void printint(int32_t);
//...
printstr("Hamming Distance:", 18);
printint(d1);
printchar('\n');

which results in

dec:11216, cycle:2566, result: (correct)

Unsuprisingly, much smaller code size. Cycle count didn't change because cycle measurement didn't include this part.

assembly optimization (v3)

counting_leading_zero optimization

When dealing with int64 right shift, we need to perform three shift and one or operation, which is a lot. But since leading zero is either in higher word or lower word, we only need to do it once.

Depend on the higher bit, there are two situation:

  • Higher bit is zero, in this case only the lower word is important, therefore the answer is normal int32 count_leading_zero + 32.
  • Higher bit is not zero, in this case only the higher word is important, therefore the answer is normal int32 count_leading_zero.
static int32_t count_leading_zero(uint32_t x, uint32_t y) { if(y != 0) { x = y; y = 32; } x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (64 - y - (x & 0x7f)); }

while loop optimization

Another place where 64 bit shift can be simplified.
Because the purpose of this loop is to traverse all digits, we can loop them seperately without two part interacting with each other.

//int32_t HammingDistancev2_c(uint32_t x0, uint32_t y0, uint32_t x1, uint32_t y1); while(max_digit > 32) { Hdist += (y0 ^ y1) & 1; y0 = y0 >> 1; y1 = y1 >> 1; max_digit -= 1; } while(max_digit > 0) { Hdist += (x0 ^ x1) & 1; x0 = x0 >> 1; x1 = x1 >> 1; max_digit -= 1; }

result

I used autorun.sh to automate these process.

The script can only run in bash environment, using shell will cause compile error.

autorun script:
#!/bin/bash function runprogram() { if [ $1 -ne 0 ] then echo "make error, exit" exit fi echo "program size:" > "out${2}" riscv-none-elf-size hammingdistance.elf >> "out${2}" echo "execution result:" >> "out${2}" rv32emu hammingdistance.elf >> "out${2}" riscv-none-elf-objdump -d hammingdistance.elf >> "dump${2}" make clean } optims=( "-O0" "-O1" "-O2" "-O3" "-Os" "-Ofast" ) if [ $# -eq 0 ] then make runprogram $? "-O0" elif [ "$1" = "all" ] then for i in "${optims[@]}" do make OLVL=$i runprogram $? $i done elif [ "$1" = "asm" ] then make OLVL=-O2 runprogram $? "-asm" else make OLVL=$1 runprogram $? $1 fi echo "finished"

program size:
textdatabssdec
9092142486811384
execution result:
Elapse cycle:1406
(result: correct)

Worse in size, but much better in speed.

cleanup (v4)

standardize function arguement

Let HammingDistance_c and HammingDistance_s have identical function type, so they are easily interchangable. This will increase cycles in c O0 optimization.

add instruction count

Add instruction count in main. Note that cycle measurement is in instruction count measurement, so instruction count might end up more than cycle count.

uint32_t startinstr = get_instret();
uint32_t startcycle = get_cycles();
int32_t d1 = HammingDistance_c(t1_x0, t1_y0, t1_x1, t1_y1);
int32_t d2 = HammingDistance_c(t2_x0, t2_y0, t2_x1, t2_y1);
int32_t d3 = HammingDistance_c(t3_x0, t3_y0, t3_x1, t3_y1);
uint32_t endcycle = get_cycles();
uint32_t endinstr = get_instret();

assembly bugfix + rigister move

In previous version, I used blt and bgt instead of bltu and bgtu, this will cause some error in some edge case. Fixing it will cause cycle to increase.

After inline count_leading_zero, the register available is enough to totally remove all saved registers. Also, since the function don't call anything, return address no longer need to be saved.

modify makefile and bash script

Split data section and functions in object file using -fdata-sections -ffunction-sections, and auto clear unused function using --gc-sections. Doing so will decrease the code size but increase the cycle.

core part of makefile
OLVL := -O0
ABIFLAGS := -march=rv32i_zicsr -mabi=ilp32
GCCFLAGS := -fdata-sections -ffunction-sections
CFLAGS := -Wall -Wextra $(OLVL)
LDFLAGS := -Wl,--gc-sections

%.o: %.s
	$(CROSS_COMPILE)gcc $(GCCFLAGS) $(ABIFLAGS) -c -o $@ $<

%.o: %.c
	$(CROSS_COMPILE)gcc $(GCCFLAGS) $(ABIFLAGS) $(CFLAGS) -c -o $@ $<

$(BIN): $(OBJS)
	$(CROSS_COMPILE)gcc $(LDFLAGS) -o $@ $^

Add more unimportant things in bash script, nothing special.

result

c implementation O0:

program size:
textdatabssdec
41243647845272
execution result:
Elapse cycle:8570
Instruction count:8588
(result: correct)

More cycle than first version (+1053 cycles) because more features are added, and apparently unoptimized new features have awful performance. Much smaller code size because most unused function are removed.

c implementation O1:

program size:
textdatabssdec
26563127803748
execution result:
Elapse cycle:2692
Instruction count:2706
(result: correct)

Slightly more cycle than first version (+130 cycles) because more features are added. Also much smaller code size.

asm implementation O0:

textdatabssdec
27203647843868
execution result:
Elapse cycle:1608
Instruction count:1626
(result: correct)

Suprisingly not much more cycle than previous version (+220 cycles). Probably because more feature are added and the bugfix. Also much smaller code size.

Final Result

cycles

target O0 O1 O2 Os
v1_c 7517 2562 2866 2798
v1_asm 3198 3198 3198 3198
v2_asm 2566 2566 2566 2566
v3_asm 1406 1404 1404 1404
v4_c 8570 2692 2693 2957
v4_asm 1608 1599 1599 1599

size

target O0 O1 O2 Os
v1_c 56266 55106 55122 55102
v1_asm 56870 56870 56870 56870
v2_asm 11216 11216 11216 11216
v3_asm 11384 10088 10080 9992
v4_c 5272 3748 3744 3656
v4_asm 3868 3528 3524 3440

appendix: (v4_c, O3): (2693, 4872)

Almost pure assembly version

rewrite main function

Copy main function from disassembled elf file, and perform these changes:

  • Manually add data segment.
  • Change all gp referenced variable into la;lw.
  • Change all jal <address> into call <label>.
  • Change all s0 referenced address into sp referenced address (useless).

write another script

Write new bash script (runmains.sh) that uses as and ld that compiles mains.s. gcc and as and ld all have unique flags.

add idiv32 and imod32

My itos function requires integer division and modulo, and since gcc provides these funcion in their standard library. I was able to get away with it. Now I need to copy them and include them into my file.

add link.ld

To remove warning.

result

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Elapse cycle:8582^@Instruction count:8596^@Hamming Distance:24^@

I don't know why printchar newline doesn't work. This part of main function is identical to c counterpart, and I didn't modify printchar during this step.

printchar, shared between main.c and main.s
# void printchar(char); printchar: la a1, pch # fake string sw a0, 0(a1) # put char into fake string li a0, STDOUT # write to stdout la a2, 1 # length = 1 li a7, SYSWRITE ecall # invoke syscall ret
main.s
li a1,17 la a0,hamstr call printstr lw a0,16(sp) call printint li a0,10 call printchar

You should refine the use of write system call.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

Reference