劉亮谷
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # 2016q3 Homework3 (software-pipelining) contributed by <`ktvexe`> - [ ]學習計算機結構並且透過實驗來驗證所學 - [ ]理解 prefetch 對 cache 的影響,從而設計實驗來釐清相關議題 - [ ]論文閱讀和思考 ## 作業要求 * 閱讀 [在計算機裡頭實踐演算法](/s/HyKtIPN0) 提到的論文: "[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)" (務必事先閱讀論文,否則貿然看程式碼,只是陷入一片茫然!),在 Linux/x86_64 (注意,要用 64-bit 系統,不能透過虛擬機器執行) 上編譯並執行 [prefetcher](https://github.com/embedded2015/prefetcher) * 說明 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` 之間的效能差異,以及 prefetcher 對 cache 的影響 * 在 github 上 fork [prefetcher](https://github.com/embedded2015/prefetcher),嘗試用 AVX 進一步提昇效能 * 修改 `Makefile`,產生新的執行檔,分別對應於 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` (學習 [phonebook](s/S1RVdgza) 的做法) * 用 perf 分析 cache miss/hit * 學習 `perf stat` 的 raw counter 命令 * 參考 [Performance of SSE and AVX Instruction Sets](http://arxiv.org/pdf/1211.0820.pdf),用 SSE/AVX intrinsic 來改寫程式碼 * 詳細描述實驗設計,以及你的觀察 ## 複習計組 >紀錄:翻閱了算盤書,想回憶書中對prefetch的描述,赫然發現書中充滿了AVX、SIMD等關鍵字,我過去一直以為我是在phonebook中認識這些技術,我的大二到底怎麼了...[name=劉亮谷][color=#c63eef] ## When Prefetching Works, When It Doesn’t, and Why ### Introduction 雖然有很多種software 或是 hardware prefetching的機制可以應付cache miss的可能,但在編譯器中,只有實作一些比較簡單的software prefetching演算法,所以說:注重效能的工程師會使用 prefetching intrinsics function來撰寫程式。 但這樣做也是有問題存在,第1點是這種方式並沒有嚴謹的規則可以描述何種插入 intrinsics funciton的方法最能真正提高效能;再者,software 與 hardware prefetching間互動的複雜度我們是不清楚的。 ## Performance of SSE and AVX Instruction Sets 這份文件簡介中很清楚的介紹SIMD與SSE/AVX instruction,以下節錄: SSE (streaming SIMD extensions)和 AVX (advanced vector extensions) 都是 SIMD (single instruction multiple data streams)的 instruction sets SIMD可讓多核的CPU進行平行處理,基本的資料搬移與算術運算指令可以被同時處理;雖然像GNU與Intel compiler皆俱備 SIMD optimization的選項,但使用適當的SIMD programming技巧,如: data packing, data reuse and asynchronous data等,可以達到更加的效能。 ### Introduction SIMD不同於SISD,他一個指令可以同時對多個data進行操作,提供了CPU層級的平行處理,直接舉例就是:同時對數個數值做加法,或是更快完成向量積。 要實作SIMD文中提到三種作法,分別是:inline assembly、 intrinsic function、 vector class * inline assembly:這種方法當然最複雜,但是優點就是可以清楚掌握每個過程的進行(好比手刻的compiler跟lex/yacc)。 * intrinsic function: 這提供了C/C++語言的介面,可以直接產生assembly code,但缺點是不保證會有最高度的最佳化。 * Vector class:使用vector class的優點是比intrinsic function更容易撰寫,但效能也會比較差。 SSE提供16支XMM registers (xmm0 ∼ xmm15)分別各有128bits AVX將原本SSE的128bits暫存器擴增至256bits,稱為 YMM registers(ymm0 ∼ ymm15) 不過AVX其實已經進展到AVX-512(@Champ Yenvm學長的投影片中提及),很可惜我的電腦不支援,這次無法實驗。 ![](https://i.imgur.com/ljewRAl.png) 此外AVX提供三組input arguments,而SSE只有兩組,所以說SSE只支援a=a+b的操作,而AVX可以辦到a=b+c的操作。 ### Optimization Scheme * Data Packing 這段落很有趣,看到程式碼:他分別一次取 256bits的資料作copy,可以看到操作次數上有所差異,然而最有趣的地方是,如果對C code version開最佳化,他的效能是與SIMD相同的,可以用objdump將最佳化結果倒出來,會發先開最佳化後,會自動將此操作轉為SIMD;而且比較AVX與SSE會發現,效能上並沒有明顯的差異,由此可見YMM registers並沒有辦法提升資料搬移的速度。 C code ```clike= for(i=0 ; i< ArraySize ; i+=8){ b[i] = a[i]; b[i+1] = a[i+1]; b[i+2] = a[i+2]; b[i+3] = a[i+3]; b[i+4] = a[i+4]; b[i+5] = a[i+5]; b[i+6] = a[i+6]; b[i+7] = a[i+7]; } ``` > 我的天阿,他不給我syntax highlight[name=劉亮谷] > 原來小於後面需要空白,不然HackMD會parsing error[name=劉亮谷][color=#c63eef] SSE: ```clike= for(i=0; i< ArraySize; i+=8} { asm volatile ( "movups %2, %%xmm0 \n\t" "movups %%xmm0, %0 \n\t" "movups %3, %%xmm1 \n\t" "movups %%xmm1, %1 \n\t" : "=m" (b[i]), "=m" (b[i+4]) : "m" (a[i]), "m" (a[i+4]) ); } ``` AVX: ```clike= for(i=0; i< ArraySize; i+=8} { asm volatile ( "vmovups %1, %%ymm0 \n\t" "vmovups %%ymm0, %0 \n\t" : "=m" (b[i]) : "m" (a[i]) ); } ``` 效能比較: | |C++|SSE4.2|AVX1| |---|---|----|----| |no opt.|163|94.5|97.7| |max opt.|75|71|75| --- * Data Reuse 要講reuse還是用範例講解最易懂:因為SSE與AVX不需重複將資料搬進reg中,而是重複使用相同register,所以再效能上會比沒有使用SIMD的版本有明顯的提升。 C code ```clike= for(i=0; i < LoopNum; i++) { c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; c[4] = a[4] + b[4]; c[5] = a[5] + b[5]; c[6] = a[6] + b[6]; c[7] = a[7] + b[7]; } ``` SSE ```clike= asm volatile ( ... "LOOP: \n\t" "addps %%xmm0, %%xmm1 \n\t" "addps %%xmm2, %%xmm3 \n\t" "dec %%ecx \n\t" "jnz LOOP \n\t ... ``` AVX ```clike= asm volatile ( ... "LOOP: \n\t" "vaddps %%ymm0, %%ymm1, %%ymm2 \n\t" "dec %%ecx \n\t" "jnz LOOP \n\t" ... ``` 效能比較: | |C++|SSE4.2|AVX1| |---|---|----|----| |no opt.|732|83.7|26.9| |max opt.|317|78.7|26.9| >其實我看不懂AVX在這邊比SSE還要快的原因[name=劉亮谷] --- * Asynchronous Data Transfer 將資料從記憶體搬到register是非常緩慢的動作,所以說以data packing的方式,是很難真正得到效能改善,不過Asynchronous Data Transfer是種同時處理運算與資料搬移的技術,可以將資料搬移的負載減少。 SSE與AVX都提供prefetching的方法,prefetching就是在CPU真正運算前,將資料先辦進cache,不過要注意的是,SSE與AVX皆無法強制data做prefetching,只能提供資訊給CPU做選擇,原因是有可能有更高優先權的process(有可能來自OS或其他)需要被處理。 效能比較: | |C++|SSE4.2|AVX1| |---|---|----|----| |no prefetching|75|71|75| |512 bytes prefetching||69.9|70.9| 這邊的結論是:我們需要更powerful的 asynchronous data transfer method >這邊看到蠻傻眼的,前面描述好像很厲害,結果卻無法有明顯的效能提升[name=劉亮谷][color=#c63eef] ## Experiment Evaluation ### Origin: ```shell= ktvexe@ktvexe-SVS13126PWR:~/sysprog21/prefetcher$ ./main 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 88613 us sse: 156130 us naive: 296468 us ``` >~~在perfetch下速度有所提升,不過經過前面的資料閱讀,我猜測這應該是再資料量與運算量少時的假象。~~[name=劉亮谷] ### cache miss: * naive ```shell Performance counter stats for './naive' (100 runs): 18,929,950 cache-misses # 93.514 % of all cache refs (66.28%) 20,315,136 cache-references (66.48%) 21,216,594 L1-dcache-load-misses (66.79%) 4,127,598 L1-dcache-store-misses (67.16%) 53,629 L1-dcache-prefetch-misses (67.13%) 66,024 L1-icache-load-misses (66.54%) 0.469926622 seconds time elapsed ( +- 0.60% ) ``` * SSE ```shell Performance counter stats for './SSE' (100 runs): 6,892,848 cache-misses # 90.614 % of all cache refs (66.29%) 7,713,047 cache-references (66.55%) 8,645,086 L1-dcache-load-misses (66.87%) 3,996,170 L1-dcache-store-misses (67.20%) 84,852 L1-dcache-prefetch-misses (67.09%) 166,184 L1-icache-load-misses (66.50%) 0.333443107 seconds time elapsed ( +- 0.80% ) ``` * SSE_PF ```shell Performance counter stats for './SSE_PF' (100 runs): 6,681,287 cache-misses # 87.202 % of all cache refs (65.92%) 7,349,026 cache-references (66.51%) 8,312,963 L1-dcache-load-misses (67.10%) 3,965,986 L1-dcache-store-misses (67.56%) 34,294 L1-dcache-prefetch-misses (67.32%) 63,104 L1-icache-load-misses (66.23%) 0.275961985 seconds time elapsed ( +- 0.97% ) ``` * 研究AVX 依樣畫葫蘆,對照SSE的模式, |Type |Meaning| |-------|-------| |__m256 |256-bit as eight single-precision floating-point values, representing a YMM register or memory location| |__m256d|256-bit as four double-precision floating-point values, representing a YMM register or memory location| |__m256i|256-bit as integers, (bytes, words, etc.)| |__m128|128-bit single precision floating-point (32 bits each)| |__m128d|128-bit double precision floating-point (64 bits each)| SSE中 `__m128i _mm_lddqu_si128 (__m128i const* mem_addr)` Description: Load 128-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary. `__m128i _mm_loadu_si128 (__m128i const* mem_addr) ` Description: Load 128-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary. 相對應AVX中 `__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)` Description: Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary. `__m256i _mm256_loadu_si256 (__m256i const * mem_addr)` Description Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary. >其實不是很清楚`lddqu`與`loadu`的差異,想增加一個`lddqu`的版本卻一直編不過。[name=ktvexe] ```shell cc -msse3 -msse2 --std gnu99 -O0 -Wall -g -DSSE_PF -o SSE_PF main.c In file included from main.c:17:0: impl.c: In function ‘sse_transpose2’: impl.c:39:26: warning: implicit declaration of function ‘_mm_lddqu_si128’ [-Wimplicit-function-declaration] __m128i I0 = _mm_lddqu_si128((__m128i *)(src + (y + 0) * w + x)); ``` >最後依照Intel Intrinsics Guide 上的Synopsis,加上`#include "pmmintrin.h"`終於編過了, 不過最終結果並沒有 ## Reference * [How to Vectorize Code Using C/C++ Classes on 32-Bit Intel® Architecture](https://software.intel.com/en-us/articles/how-to-vectorize-code-using-cc-classes-on-32-bit-intel-architecture) * [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html) * [Introduction to Intel® Advanced Vector Extensions](https://software.intel.com/sites/default/files/m/d/4/1/d/8/Intro_to_Intel_AVX.pdf) * [Intel® Architecture nstruction Set Extensions Programming eference](https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf) * [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide) * [SIMD Programming Introduction](https://docs.google.com/presentation/d/1LeVe7EAmZvqD3KN7p4Wynbd36xPOk9biBCFrdmsqzKs/edit#slide=id.p3) * [Hotball's HIVE](https://www.csie.ntu.edu.tw/~r89004/hive/sse/page_1.html) 未讀 * [Lattice gauge theory](https://en.wikipedia.org/wiki/Lattice_gauge_theory) * [許士杰共筆](https://embedded2015.hackpad.com/-Homework-7-8-XZe3c94XjUh) * [周曠宇共筆](https://embedded2015.hackpad.com/Week8--VGN4PI1cUxh)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully