shhung
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Assignment2: GNU Toolchain contributed by <[`shhung`](https://github.com/shhung/Image-scaling-with-Bilinear-interpolation-by-float32-multiplication)> ## Motivation The work I selected is the [Image scaling with Bilinear interpolation](https://github.com/linyu425/ComputerArchitecture/tree/main/hw1) done by [linyu0425](https://hackmd.io/@linyu0425/SJHkb8lWT). Due to the interesting work, I selected the work for my assignment 2. In addition, I want to review more floating-point processing that has been implemented in the work, such as addition and multiplication. ## Make C program be executed with rv32emu Compiling C program is quite easy by gcc for risc-v. ```shell $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o ImgScale ImgScale.S $ rv32emu ImgScale ``` ## Make assembly program be executed with rv32emu By referencing [asm-hello](https://github.com/sysprog21/rv32emu/tree/master/tests/asm-hello) which is the demo in rv32emu for [System calls](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md), we could have a `Makefile` and linker script for ours project. ### Modification All that needs to be done is modifying the syscall part in the original program, as Ripes has its own design for syscalls. In order to focus on the performance of the process logic, I deleted the part that outputs to the console. ### Optimization method Since this work implements multiple sub-functions used for image scaling, I have split it into several parts. Due to this process, we can analyze each unit of image scaling and find out the parts to be optimized. The function structure is shown in the figure below, and the performance of each part will be discussed later. ``` ─── ImgScale ├── count_leading_zeros ├── imul32 ├── fmul32 └── fadd32 ``` ### Bug During testing the parts of the whole program, I found that the functions `imul32` and `fmul32` have a bit of error, and this error occurs in both the C and assembly programs. Additionally, the `imul32` returns a 32-bit signed integer, which could lead to overflow. The same problem occurs in `fmul32` as well. Even so, there is no problem with the overall logic, so optimization will still be done based on the buggy program. ## Execute the program The workable code is available at [src](https://github.com/shhung/Image-scaling-with-Bilinear-interpolation-by-float32-multiplication/tree/main/src), and the Makefile is provided to generate ELF file which is runable on the rv32emu. 1. Buile the ELF file Following the command, you will get `main.elf` and `ImgScaleFromC.elf`. ```shell $ make ``` 2. Run each of it Make sure you have already installed `rv32emu` and set the path for `rv32emu`. If not, check the [Lab2: RISC-V RV32I[MACF] emulator with ELF support](https://hackmd.io/@sysprog/SJAR5XMmi) to setup your environment. ```shell $ rv32emu main.elf ``` ```shell $ rv32emu ImgScaleFromC.elf ``` Expected output: For `main.elf` and `ImgScaleFromC.elf`, the output would only differ at elapsed cycle. ```shell 0.954780 0.877887 0.800995 0.724102 0.647210 0.921899 0.826679 0.731460 0.636240 0.541020 0.889018 0.775471 0.661924 0.548377 0.434830 0.856138 0.724263 0.592389 0.460514 0.328640 0.823257 0.673055 0.522853 0.372652 0.222450 elapsed cycle: 39580 inferior exit code 0 ``` ## Contrast the handwritten and compiler-optimized assembly code By following the example of [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c), we can evaluate our work by accumulating cycle count. | | Handwritten | O0 | O1 | O2 | O3 | Os | Ofast | | ----- | ----------- | ----- | ----- | ----- | ----- | ----- | ----- | | Size | 56998 | 83940 | 82384 | 82364 | 83244 | 82364 | 83244 | | Cycle | 39610 | 39065 | 16162 | 16459 | 12183 | 17012 | 12183 | It can be obviously observed that the cycle count has a significant improvement when compiler optimization is applied. However, the size is not affected much. Below, we will describe the difference between the assembly code generated by the compiler and the handwritten assembly code. The complete code is accessible at [shhung](https://github.com/shhung/Image-scaling-with-Bilinear-interpolation-by-float32-multiplication/tree/main/src/disasm). ### count_leading_zeros Since `count_leading_zeros` is already simplified, there is almost nothing to be optimized. The only thing that can be optimized is omitting store/load operations by replacing callee-saved registers with caller-saved registers. ### imul32 By replacing callee-saved registers with caller-saved registers, store/load operations are not required in `imul32`. ### fadd32 Since `count_leading_zeros` is only used once in `fadd32`, `-O2` optimization and further levels replace the function call to `count_leading_zeros` in `fadd32` with directly writing the operation logic in fadd32. ### fmul32 Same as `imul32`, `-O1` optimization generated code uses fewer callee-saved registers. This operation results in fewer load/store instructions. For `-O2` optimization and further levels, the function is branchless, and store/load operations are not required due to replacing callee-saved registers with caller-saved registers. ### Difference Based on the above observation results, it can be found that the purpose and method of optimization are very similar, so we only list the differences between handwritten and compiler-optimized code for `fadd32`. It can be obviously found that there are no store/load operations at the prologue/epilogue of the compiler-optimized code. Additionally, please check line 30 of the compiler-optimized code. Comparing it with the handwritten code, you will find that `count_leading_zeros` is executed here. By embedding it directly into `fadd32`, you can reduce the possible delays caused by function calls. Handwritten: ```c fadd32: addi sp , sp , -52 sw s0 , 0(sp) sw s1 , 4(sp) sw s2 , 8(sp) sw s3 , 12(sp) sw s4 , 16(sp) sw s5 , 20(sp) sw s6 , 24(sp) sw s7 , 28(sp) sw s8 , 32(sp) sw s9 , 36(sp) sw s10 , 40(sp) sw s11 , 44(sp) sw ra , 48(sp) mv s0 , a0 mv s1 , a1 li t0 , 0x7fffffff and t1 , s0 , t0 and t2 , s1 , t0 blt t2 , t1 , noswap mv t0 , s0 mv s0 , s1 mv s1 , t0 noswap: li t0 , 31 srl s2 , s0 , t0 srl s3 , s1 , t0 li t0 , 0x7fffff li t1 , 0x800000 and t2 , s0 , t0 or s4 , t2 , t1 and t2 , s1 , t0 or s5 , t2 , t1 li t0 , 23 li t1 , 0xff srl t2 , s0 , t0 and s6 , t2 , t1 srl t2 , s1 , t0 and s7 , t2 , t1 sub t0 , s6 , s7 li t1 , 24 blt t1 , t0 , setalign_1 mv s8 , t0 j setalign_exit setalign_1: mv s8 , t1 setalign_exit: srl s5 , s5 ,s8 or t0 , s2 , s3 bne t0 , x0 , setma_1 add s4 , s4 , s5 j setma_exit setma_1: sub s4 , s4 , s5 setma_exit: mv a0 , s4 call count_leading_zeros mv s9 , a0 mv s10 , x0 li t0 , 8 blt t0 , s9 , shift_false li t0 , 8 sub s10 , t0 , s9 srl s4 , s4 , s10 add s6 , s6 , s10 j shift_exit shift_false: li t0 , 8 sub s10 , s9 , t0 sll s4 , s4 , s10 sub s6 , s6 , s10 shift_exit: li t0 , 0x80000000 and t0 , s0 , t0 li t1 , 23 sll t1 , s6 , t1 li t2 , 0x7fffff and t2 , s4 , t2 or t0 , t0 , t1 or a0 , t0 , t2 lw s0 , 0(sp) lw s1 , 4(sp) lw s2 , 8(sp) lw s3 , 12(sp) lw s4 , 16(sp) lw s5 , 20(sp) lw s6 , 24(sp) lw s7 , 28(sp) lw s8 , 32(sp) lw s9 , 36(sp) lw s10 , 40(sp) lw s11 , 44(sp) lw ra , 48(sp) addi sp , sp , 52 ret count_leading_zeros: addi sp , sp , -8 sw s0 , 0(sp) sw ra , 4(sp) mv s0 , a0 # s0 = x srli t0 , s0 , 1 or s0 , s0 ,t0 srli t0 , s0 , 2 or s0 , s0 ,t0 srli t0 , s0 , 4 or s0 , s0 ,t0 srli t0 , s0 , 8 or s0 , s0 ,t0 srli t0 , s0 , 16 or s0 , s0 ,t0 srli t0 , s0 , 1 li t1 , 0x55555555 and t0 , t0 , t1 sub s0 , s0 , t0 li t1 , 0x33333333 and t2 , s0 , t1 srli t0 , s0 , 2 li t1 , 0x33333333 and t0 , t0 , t1 add s0 , t0 , t2 srli t0 , s0 , 4 add t0 , t0 , s0 li t1 , 0x0f0f0f0f and s0 , t0 , t1 srli t0 , s0 , 8 add s0 , s0 , t0 srli t0 , s0 , 16 add s0 , s0 , t0 li t0 , 32 andi t1 , s0 , 0x7f sub a0 , t0 , t1 lw s0 , 0(sp) lw ra , 4(sp) addi sp , sp , 8 ret ``` Compiler-optimized: ```c 00010830 <fadd32>: 10830: 800007b7 lui a5,0x80000 10834: fff78793 add a5,a5,-1 # 7fffffff <__BSS_END__+0x7ffdb0db> 10838: 00a7f733 and a4,a5,a0 1083c: 00b7f7b3 and a5,a5,a1 10840: 00050813 mv a6,a0 10844: 00058613 mv a2,a1 10848: 00f74663 blt a4,a5,10854 <fadd32+0x24> 1084c: 00050613 mv a2,a0 10850: 00058813 mv a6,a1 10854: 00800537 lui a0,0x800 10858: 41765693 sra a3,a2,0x17 1085c: 41785793 sra a5,a6,0x17 10860: fff50713 add a4,a0,-1 # 7fffff <__BSS_END__+0x7db0db> 10864: 0ff6f693 zext.b a3,a3 10868: 0ff7f793 zext.b a5,a5 1086c: 00e675b3 and a1,a2,a4 10870: 40f687b3 sub a5,a3,a5 10874: 00e87733 and a4,a6,a4 10878: 01800893 li a7,24 1087c: 00a5e5b3 or a1,a1,a0 10880: 00a76733 or a4,a4,a0 10884: 00f8d463 bge a7,a5,1088c <fadd32+0x5c> 10888: 01800793 li a5,24 1088c: 40f757b3 sra a5,a4,a5 10890: 01066833 or a6,a2,a6 10894: 00f58733 add a4,a1,a5 10898: 00085463 bgez a6,108a0 <fadd32+0x70> 1089c: 40f58733 sub a4,a1,a5 108a0: 00175793 srl a5,a4,0x1 108a4: 00e7e7b3 or a5,a5,a4 108a8: 0027d593 srl a1,a5,0x2 108ac: 00b7e7b3 or a5,a5,a1 108b0: 0047d593 srl a1,a5,0x4 108b4: 00b7e7b3 or a5,a5,a1 108b8: 0087d593 srl a1,a5,0x8 108bc: 00b7e7b3 or a5,a5,a1 108c0: 0107d593 srl a1,a5,0x10 108c4: 00b7e7b3 or a5,a5,a1 108c8: 55555537 lui a0,0x55555 108cc: 0017d593 srl a1,a5,0x1 108d0: 55550513 add a0,a0,1365 # 55555555 <__BSS_END__+0x55530631> 108d4: 00a5f5b3 and a1,a1,a0 108d8: 40b787b3 sub a5,a5,a1 108dc: 33333537 lui a0,0x33333 108e0: 33350513 add a0,a0,819 # 33333333 <__BSS_END__+0x3330e40f> 108e4: 0027d593 srl a1,a5,0x2 108e8: 00a5f5b3 and a1,a1,a0 108ec: 00a7f7b3 and a5,a5,a0 108f0: 00f585b3 add a1,a1,a5 108f4: 0045d793 srl a5,a1,0x4 108f8: 0f0f1537 lui a0,0xf0f1 108fc: 00b787b3 add a5,a5,a1 10900: f0f50513 add a0,a0,-241 # f0f0f0f <__BSS_END__+0xf0cbfeb> 10904: 00a7f7b3 and a5,a5,a0 10908: 0087d593 srl a1,a5,0x8 1090c: 00b787b3 add a5,a5,a1 10910: 0107d593 srl a1,a5,0x10 10914: 00b787b3 add a5,a5,a1 10918: 07f7f793 and a5,a5,127 1091c: 02000593 li a1,32 10920: 40f585b3 sub a1,a1,a5 10924: 00800513 li a0,8 10928: 02b54863 blt a0,a1,10958 <fadd32+0x128> 1092c: fe878793 add a5,a5,-24 10930: 40f75733 sra a4,a4,a5 10934: 00f686b3 add a3,a3,a5 10938: 00971713 sll a4,a4,0x9 1093c: 01769693 sll a3,a3,0x17 10940: 800007b7 lui a5,0x80000 10944: 00975713 srl a4,a4,0x9 10948: 00d76733 or a4,a4,a3 1094c: 00f67533 and a0,a2,a5 10950: 00a76533 or a0,a4,a0 10954: 00008067 ret 10958: 01800593 li a1,24 1095c: 40f587b3 sub a5,a1,a5 10960: 00f71733 sll a4,a4,a5 10964: 40f686b3 sub a3,a3,a5 10968: 00971713 sll a4,a4,0x9 1096c: 01769693 sll a3,a3,0x17 10970: 800007b7 lui a5,0x80000 10974: 00975713 srl a4,a4,0x9 10978: 00d76733 or a4,a4,a3 1097c: 00f67533 and a0,a2,a5 10980: 00a76533 or a0,a4,a0 10984: 00008067 ret ``` ### Cycle count for each part | | count_leading_zeros | imul32 | fadd32 | fmul32 | | ----- | --- | ------ | ------ | ------ | | O0 | 81 | 105 | 191 | 284 | | O1 | 11 | 67 | 106 | 197 | | O2 | 11 | 11 | 107 | 195 | | O3 | 0 | 0 | 103 | 191 | | Os | 11 | 68 | 107 | 209 | | Ofast | 0 | 0 | 103 | 191 | ### Summary Through the process of reviewing optimized assembly code, I found some rules for writing efficient assembly code: 1. Fewer store/load operations are better for callable functions. 2. Rearrange instructions to reduce unnecessary conditional branches. By following these rules when writing assembly code, you can create a relatively clear and efficient program. This will make it more convenient to carry out subsequent optimizations based on this foundation.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully