徐銘宏
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # 2017q1 Homework3 (software-pipelining) contributed by < `xdennisx` > ## 論文研讀 閱讀論文:[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) 論文中前五大點[這裡](https://hackmd.io/s/HJtfT3icx)都有做大部分的解釋 ### Case Studies #### Positive Group - <i>433.milc</i> 下面程式碼是有關一個 3x3 的矩陣運算 ![](https://i.imgur.com/GadFayJ.png) `milc` 大部分的 cache-miss 都是發生在存取小規模的陣列,因為這些 matrix access 都是屬於 **short stream**,所以無法用 HW prefetch 去訓練它,因此他在 function call 的附近做了 SW prefetch 的動作,有了 SW prefetch,速度甚至比最好的 HW prefetch 快了 2.3 倍。 - <i>459.GemsFDTD</i> `GemsFDTD` 大部分的問題出在 **complex indirect memory accesses**,像下面的變數 m 就是一個 indirect index,而 HW prefetch 對這種 irregular index 的加速效能有限,在這邊只增加了 12%(stream prefetcher),然而 SW prefetch 卻能降低超過 50% 的執行時間(與 Baseline 相比) ![](https://i.imgur.com/E45NwuE.png) - <i>429.mcf</i> `mcf` 有密集的記憶體操作,他也使用了 array of pointer 的操作,由於 array of pointer 也是屬於 irregular behavior,所以 HW prefetcher 也不會表現太好。 ![](https://i.imgur.com/LCXHP54.png) 這邊將 `arc` structure 做 prefetch,利用下一個 array index 去計算 next address,這是 HW prefetcher 做不到的。而使用 SW+STR 可以讓原本只有 STR 的效能提升 180% - <i>470.lbm</i> `lbm` 計算兩個 3-D array,這讓 GHB 跟 STR 可以有效預測 address,即使如此,實驗過後還是發現 SW prefetcher 還是可以表現的比較好,大大提升 cache hit ratio。 另外發現 `lbm` 對 prefetch distance 非常敏感,下圖顯示 SW prefetching 可達到將近 100% coverage,因為幾乎所有 `srcGrid` 的元素都會被 prefetch 到 ![](https://i.imgur.com/JO8hKhX.png) 這張圖說明不同的 distance 對 performance 的影響 ![](https://i.imgur.com/EoIzbrx.png) 下圖說明一個 prefetch distance unit 取決於下一個 loop iteration ![](http://i.imgur.com/JAv02Rl.png) - <i>462.libquantum</i> `libquantum` 主要存取的方式都是 **fixed-stride streaming**,所以 SW 跟 HW 都可以預測 `reg->node` 的 address,可是 SW 可以做到 1.23 倍速的提升,因為 HW 有太多的 L1 cache miss penalties。 ![](http://i.imgur.com/KbRgcwp.png) 下圖顯示因為有太多的 L1 cache miss,所以 execution cycles 跟 on-chip L2 traffics 都會增加。GHB 在 SW 的訓練下,因為產生過多沒用的 prefetch,造成效能下降 55%。 ![](http://i.imgur.com/Hi1pyRh.png) - <i>403.gcc</i> `gcc` 屬於 RDS,是非常 irregular,而且他的 access 是 indirect,藉由 insert prefetch 我們得到 14% 的效能提升,下面是程式碼片斷 ![](http://i.imgur.com/vY8BwUe.png) - <i>434.zeusmp</i> `zeusmp` 也是一個 3-D 的計算,每次 iteration 都會存取每個 dimension 的所有鄰近點,`v3` 可以同時被 HW 跟 SW prefetch,SW 降低了 L1 cache misses 而 HW 除去了一些 late prefetches,分別提升了 10%,4% 的效能 ![](http://i.imgur.com/i4P9ail.png) #### Neutral Group - <i>401.bzip2</i> `bzip2` 的核心演算法是[Burrows–Wheeler Transform](https://zh.wikipedia.org/wiki/Burrows-Wheeler%E5%8F%98%E6%8D%A2),當中有很多的 branch,所以 HW prefetchers 沒有什麼用處,SW 在這裡也沒有表現的很好 ![](http://i.imgur.com/035S5NL.png) 下圖顯示當中有很多的 early prefetch,不過 software prefetching 還是能提升 8% 左右的效能 ![](http://i.imgur.com/oYPHNNf.png) - <i>436.cactusADM</i> `cactusADM` 也是計算 3-D 陣列,每次 iteration 都會存取一個維度鄰近的點,下圖在 `alp`、`ADM_kxx_stag_p` 插入 prefetch,雖然這裡的 data strcture 都是很好預測的,GHB 的表現還是比 STR 好很多,因為有太多的 in-flight stream,不過 SW prefetcher 可以縮小他們之間的差距(5%)。GHB 經過 training 之後可以減少 7% 的 late prefetcher ![](http://i.imgur.com/Vl2xV9m.png) - <i>450.soplex</i> `soplex` 中大部份 cahce-miss 發生在 C++ class overloaded operators 跟沒有迴圈的 member function,因為不知道實際的 data structure 長怎樣,所以不知道要在哪邊插入 prefetcher,但是他的 memory access 有一定程度的規律,所以 HW prefetching 就變得相對有用。 ![](http://i.imgur.com/gGBe2FF.png) 上圖顯示雖然 SW perfetching 是有用的,但是他被歸類在 neutral group 是因為 bandwidth consumption 以及太多錯誤的 SW prefetches - <i>410.bwaves</i> #### Negative Group - <i>482.sphinx3</i> `sphinx3` 有許多不同的資料存取方式,像是 RDS, array of pointers, regular strides,不過主要是 unit-stride streams,因此 HW prefetchers 很適合用在這上面。因為 prefetch timeliness 和小陣列的關係,SW prefetchers 不會用到每一個 data structure 上。 - <i>437.leslie3d</i> ![](https://i.imgur.com/vuvjEuX.png) `leslie3d` 有非常規律的 stride,但因為每個 iteration 之間的運算很受限,非得要用大的 prefetch distance,造成過多的 instruction 反而抵消原本預期帶來的利益。 ## 測試環境 ```shell dennis@dennis-X550CC:~/prefetcher$ lscpu Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數:2 每通訊端核心數:2 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 58 Model name: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz 製程: 9 CPU MHz: 993.585 CPU max MHz: 2700.0000 CPU min MHz: 800.0000 BogoMIPS: 3592.01 虛擬: VT-x L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 3072K NUMA node0 CPU(s): 0-3 ``` ## 最初實驗數據 ### 時間 ``` naive: 250935 us sse: 124485 us sse prefetch: 65676 us ``` ### 用 **perf stat** 看一下cache-miss ``` Performance counter stats for './naive_transpose' (100 runs): 19,145,519 cache-misses # 95.049 % of all cache refs ( +- 0.13% ) 20,142,743 cache-references ( +- 0.04% ) 1,449,345,026 instructions # 1.17 insns per cycle ( +- 0.01% ) 1,238,612,771 cycles ( +- 0.66% ) 0.492766897 seconds time elapsed ( +- 0.73% ) Performance counter stats for './sse_transpose' (100 runs): 6,959,105 cache-misses # 91.276 % of all cache refs ( +- 0.52% ) 7,624,244 cache-references ( +- 0.26% ) 1,237,818,237 instructions # 1.38 insns per cycle ( +- 0.01% ) 899,769,987 cycles ( +- 0.75% ) 0.358673348 seconds time elapsed ( +- 0.81% ) Performance counter stats for './sse_prefetch_transpose' (100 runs): 6,752,137 cache-misses # 89.026 % of all cache refs ( +- 0.22% ) 7,584,432 cache-references ( +- 0.08% ) 1,283,460,940 instructions # 1.75 insns per cycle ( +- 0.01% ) 733,322,586 cycles ( +- 0.29% ) 0.286447261 seconds time elapsed ( +- 0.46% ) ``` 好像用了 prefetch,cache-miss 就會下降的樣子,而且時間也有明顯下降。不過用 perf 所測得的 cache-miss 是**整體**的,而 prefetch 所減少的是 **L1-dcache cache misses** ### L1-cache-test ``` Performance counter stats for './sse_transpose' (100 runs): 8,513,522 L1-dcache-load-misses ( +- 0.04% ) 4,136,589 L1-dcache-store-misses ( +- 0.29% ) 65,288 L1-dcache-prefetch-misses ( +- 0.38% ) 94,785 L1-icache-load-misses ( +- 5.73% ) 0.345807079 seconds time elapsed ( +- 0.60% ) Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,517,476 L1-dcache-load-misses ( +- 0.02% ) 4,175,545 L1-dcache-store-misses ( +- 0.19% ) 64,775 L1-dcache-prefetch-misses ( +- 0.43% ) 82,900 L1-icache-load-misses ( +- 4.33% ) 0.284601743 seconds time elapsed ( +- 0.50% ) ``` ㄜ....所以這是沒有減少的意思嗎... 我嘗試不同的 prefetch distance 看看 ``` D = 4 Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,527,637 L1-dcache-load-misses ( +- 0.03% ) 4,130,140 L1-dcache-store-misses ( +- 0.36% ) 68,948 L1-dcache-prefetch-misses ( +- 0.31% ) 88,588 L1-icache-load-misses ( +- 4.91% ) 0.294001054 seconds time elapsed ( +- 0.58% ) D = 16 Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,552,080 L1-dcache-load-misses ( +- 0.03% ) 4,115,989 L1-dcache-store-misses ( +- 0.41% ) 64,701 L1-dcache-prefetch-misses ( +- 0.42% ) 92,911 L1-icache-load-misses ( +- 5.27% ) 0.304326302 seconds time elapsed ( +- 0.72% ) D = 32 Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,580,343 L1-dcache-load-misses ( +- 0.03% ) 4,080,988 L1-dcache-store-misses ( +- 0.41% ) 65,730 L1-dcache-prefetch-misses ( +- 0.38% ) 112,045 L1-icache-load-misses ( +- 4.02% ) 0.301808613 seconds time elapsed ( +- 0.54% ) D = 64 Performance counter stats for './sse_prefetch_transpose' (100 runs): 8,595,820 L1-dcache-load-misses ( +- 0.03% ) 4,089,623 L1-dcache-store-misses ( +- 0.32% ) 65,383 L1-dcache-prefetch-misses ( +- 0.38% ) 107,591 L1-icache-load-misses ( +- 4.24% ) 0.303709076 seconds time elapsed ( +- 0.52% ) ``` D = 8 的時候是最好的 > 之後該來探討為啥 8 是最好的 ### Raw Counter 在一些系統架構下,有一些 counter 沒有在 `perf list` 裡面,如果是 Intel CPU 的話可以看 [Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B的19章](http://www.intel.com/content/www/us/en/processors/core/core-technical-resources.html),要對應到自己的 CPU 型號查 mask number 及 event number。 我的 CPU 是 Intel Core i5-3337U,是屬於第三代的,對應可用的 Raw Counter 是 **LOAD_HIT_PRE.SW_PF** 跟 **LOAD_HIT_PRE.HW_PF** ![](https://i.imgur.com/gsLCQVL.png) #### 測試(這邊 D = 8) - `perf stat -e r014c ./raw_counter_sse` ``` Performance counter stats for './sse_transpose' (100 runs): 309 r014c ( +- 2.23% ) 0.344172960 seconds time elapsed ( +- 0.48% ) ``` - `perf stat -e r014c ./raw_counter_sse_prefetch` ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 656,289 r014c ( +- 2.72% ) 0.288038377 seconds time elapsed ( +- 0.47% ) ``` - `perf stat -e r024c ./time_sse` ``` Performance counter stats for './sse_transpose' (100 runs): 599,852 r024c ( +- 0.34% ) 0.347760797 seconds time elapsed ( +- 0.65% ) ``` - `perf stat -e r024c ./time_sse_prefetch` ``` Performance counter stats for './sse_prefetch_transpose' (100 runs): 614,884 r024c ( +- 0.79% ) 0.293914743 seconds time elapsed ( +- 0.82% ) ``` 整體來說,prefetch 還是有增加 cache-hit 的次數 ## SSE/AVX intrinsic 這裡有一個很不一樣的是 `__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)`,利用控制 `imm8` 可以達到 unpack_ep128 的效果 - Operation ``` SELECT4(src1, src2, control){ CASE(control[1:0]) 0: tmp[127:0] := src1[127:0] 1: tmp[127:0] := src1[255:128] 2: tmp[127:0] := src2[127:0] 3: tmp[127:0] := src2[255:128] ESAC IF control[3] tmp[127:0] := 0 FI RETURN tmp[127:0] } dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0]) dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4]) dst[MAX:256] := 0 ``` > 發現我的電腦根本不能用 AVX2,WTH 該買新電腦了 ## Reference - [吳彥寬學長筆記](https://hackmd.io/s/ryTASBCT) - [kaizsv](https://hackmd.io/IwVmCMGMCYFMGYC0A2e5mICwAYDs5EBDAMwBMAORWU8ATl2N2EhHNiA=?view)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully