Pieris Kalligeros
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Project 3 Report: Tiled Matrix Multiplication # Task 1 INT8 extension to INT16 This task exploits the BRAM width address that we prepared from Project 2 where we only used 1 byte for operands even though we had 4 bytes (addresses 0 to 4). In project 2 when we read 32bit data from BRAM we only used 8 bits for the operands but in this task since we increase to INT16 we will use 16bits for the operand, as mentioned in Q[127](https://piazza.com/class/lk4la7genxl4ew/post/127). Since we did not have the implementation of Project 2 in Verilog so that we could change it, we couldn't finish this task even though it would be straightforward to change it in Verilog. # Task 2 Quantization-Dequantization results for different values of `m` Due to different quantization bits resulting in overflow or underflow, result predictions differ (bounding boxes found vs not found) for different values of `m`. Due to the discussion made with the TAs in Piazza questions [148](https://piazza.com/class/lk4la7genxl4ew/post/148) and [152](https://piazza.com/class/lk4la7genxl4ew/post/152), values below 12 don't yield any results. | Value of`m` | `prediction.jpg` (`dog.jpg`) | `prediction.jpg(test.jpeg)`| | -------- | -------- | ----------- | | 10 | ![m10](https://hackmd.io/_uploads/Bkfh-FxIa.jpg)| ![image](https://hackmd.io/_uploads/H1IcVFgLa.png)| | 11 | ![m11](https://hackmd.io/_uploads/SJnxGtlUa.jpg)| ![image](https://hackmd.io/_uploads/SkKCVYe8T.png)| | 12 | ![m12](https://hackmd.io/_uploads/S1OZMtxLa.jpg)| ![image](https://hackmd.io/_uploads/ryrgBYxL6.png)| | 13 | ![m13](https://hackmd.io/_uploads/HywzMFxUp.jpg)|![image](https://hackmd.io/_uploads/rJSQBFxIa.png)| | 14 | ![m14](https://hackmd.io/_uploads/ByeXMYlIp.jpg)|![image](https://hackmd.io/_uploads/HJbDHtlIp.png) | # Task 3 Tiling ## Algorithm description |![algo](https://hackmd.io/_uploads/H1PvP9eIT.jpg "algo")| |:--:| | **Figure 1** Tiled matrix multiplication algorithm diagram | Tiled matrix multiplication can be tricky at first since vanilla matrix multiplication involves multiplying the row of the first matrix with the column of the second matrix to get a single output of the output matrix. The main difference is that a single element of the output matrix is not computed at once but it gradually gets formed by adding partial results. The optimization is the key. We optimize for data sharing and minimize memory fetching. From the diagram above it can be seen that depending on the tile size we populate many partial output results that we then sequentially add new partial sums to them. Therefore when the horizontal traversal of the first matrix and the vertical traversal of the second matrix are done, a whole **tile** of the output will have been computed. ## Software tiling implementation (Project Step 3) To implement software tiling, we define the function `matmul_CPU_tiling(mat1,mat2)` which takes full-sized matrices (e.g. weight and activation) as inputs and returns `out`, their matmul result using tiling. The provided skeleton code populates the appropriate shape for `out` (MxN) which is initialized to zeros. This is similar to the *Output Stationary* setup shown in the project document figure. The tile size is *hard-coded* as 8 for step 3 of the project, while in step 5 this is provided in the `global_config['tile_size']` variable (tested in test_step5 with tile_size=40). SW_TILE_SIZE = 8 M, K = mat1.shape K, N = mat2.shape out = np.zeros((M, N)) The implemented part of our code can be seen below: for i in range(0, M, SW_TILE_SIZE): for j in range(0, N, SW_TILE_SIZE): for k in range(0, K, SW_TILE_SIZE): # Compute the end indices for the tiles, handling edge cases i_end = min(i + SW_TILE_SIZE, M) j_end = min(j + SW_TILE_SIZE, N) k_end = min(k + SW_TILE_SIZE, K) # Extract the tiles tile_mat1 = mat1[i:i_end, k:k_end] tile_mat2 = mat2[k:k_end, j:j_end] # Perform matrix multiplication on the tiles and accumulate the result out[i:i_end, j:j_end] += self.matmul(tile_mat1, tile_mat2) ### Handling edge cases of `matrix_dimension/tile_size` using end indices In our implementation, we handle edge cases of **not evenly divisible** matrix dimensions by tile size by calculating **end indices** for the final tiles in each dimension. As seen in the code above, we use $min(\ (current\ iterator\ value + tile\ size)\ ,\ dimension\ )$ for each corresponding dimension's end index. In this way, if there is a tile with non-matching size in the *end* (i.e. smaller or bigger than the dimension), we use the smallest one that *fits* in the original matrix. ### Extracting tiles We use 3 for loop iterators, i.e. first 2 iterators`i`, `j` corresponding to number of *outer* dimensions $(M,N)$ used for the *output*, and the inner iterator `k`, corresponding to the *inner shared dimensions* of `mat1` and `mat2` (i.e. $K$) used for the tiling. Then, following the [tiling algorithm](#Algorithm-description), `tile_mat1` is taken as the **slice** from `mat1` corresponding to the *row* of the *current tile* (i.e. `[iterator : end index of iterator]`, while `tile_mat2`) is similarly the **slice** from `mat2` corresponding to the *column* of the *current tile*. Finally, we need to **accumulate** the partial sum to the corresponding slice of `out` corresponding to the $(i^{th},j^{th})$ position (i.e. `[i:i_end , j:j_end]`) for all tiles (inside the inner loop). # Task 4: Design of FPGA modules for tiling (Verilog) Important considerations in the hardware implementation of tiling is the addition of FSM signals. Since the two matrices to be multilplied are already passed in terms of their tiles, the FSM has to dictate when to move to the next tile from `w_bram` and multiply it with the next tile of `a_bram`. A module hierarchically on top of `matmul_system` called `tiling_controller` would need to be introduced to manage the movement of tiles. It used `num_iter_K` to determine the number of tiles and *when to finish the full matmul* (i.e. change FSM state to end and set SP_BRAM address 100 = 1), `num_iter_row` to fetch the corresponding row from activation BRAM (A_BRAM), and `num_iter_col` to fetch the corresponding column from weight BRAM (W_BRAM). Note that python code already populates the tiles *flattened* in BRAMs, and that is why these iterators are needed as inputs to the system. At each tile matrix multiplication, the data from A_BRAM and W_BRAM is loaded to the systolic array. The `step_a` and `step_w` are used to index the addresses of BRAMs with correct step size when buffering to `a_buffer` and `w_buffer` respectively for streaming data to the systolic array. The tiles (`num_iter_K`) are **looped** and at each iteration the matrix multiplication is calculated normally using *systolic array* (with appropriate mode OS/WS). Note that *similarly to project 2*, the *input activations* are **sheared** and **pre-loaded** for Weight Stationary mode(WS), while *both* input activations *and* input weights are sheared for Output Stationary mode (OS). These are controlled by the FSM from the SP_BRAM` mode` signal, similarly to project 2. The iterators are also used to store the output in O_BRAM addresses. In Output Stationary Mode (OS), the output is accumulated *within PEs* (without resetting for multiple tiles) with a 1-to-1 mapping of PEs to output tile, and those are stored *directly* to the output buffer `o_buffer`. Then, data from the buffer is stored to O_BRAM (flattened), for the python code to be able to access the result when multiplication ends. In Weight Stationary mode, the **new accumulator** module is needed. This module *accumulates* the results (after *unshearing* the output similarly to project 2) until *all* outputs are calculated for the tiles by the WS systolic array. Then after accumulation is complete, the outputs are passed to `o_buffer` and in turn to O_BRAM flattened for access from python. Since the activation matrix is the first matrix to be multiplied tiles will be traversed horizontally whereas the weight matrix has to be traversed vertically, similar to the [algorithm description](#Algorithm-description). A final note is that correct bit-alignment between the way python provides the tiles to BRAM and Verilog expects them needs to be ensured in terms of size and format (MSB:0 or 0:MSB). # Task 5: Integrating all steps Performing tiled matrix multiplication with FPGA acceleration is inherently bound by the BRAM size, in our case less than 2048. Limited BRAM size is a known disadvantage of FPGAs. Therefore calling the hardware implementation of tiling directly to handle YOLO's matrix multiplication from would break. Particularly, ```bram_a_arr[dim*dim*(i+k*num_iter_row):dim*dim*(i+k*num_iter_row+1)] = serial_tile_mat1``` Therefore we need to combine Step 3 with Step 4 and leverage both software and hardware tiling. Hardware tiling is now treated as normal matrix multiplication and is abstracted away. An important optimization before jumping to the software tiling abstraction level is to exploit the maximum tile size. The maximum tile size is a hard boundary since hardware is not dynamically configurable. From `TASK4_python_code` it can be inferred that the underlying systolic array and hence the maximum tile size is 8. Leveraging this insight in the hardware tiling part of step 5 we have padded the matrices if they're less than 8x8 and then passed them to the FPGA acceleration. In the higher level of abstraction, i.e. software tiling, the tile size is set by the `global_config` and the call to the hardware implementation of tiling is maxed in all dimensions of the matrices by 40. This means that calls to `partial_fpga = self.matmul_**(tile_mat1,tile_mat2)` have their respective `tile_mat1` and `tile_mat2` shapes always less than 40 for every dimension to respect the hard boundary of the Verilog implementation. The results are correctly predicted as can be seen in test5 jupyter notebook.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully