Hsin Hung Lin
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # 2016q3 Homework3 (software-pipelining) contributed by <`TotallyWrong`> ## 開發作業環境 **CPU** : Intel Core i5-5200U **Cache size**: L1 Cache 128KB, L2 Cache 512K, L3 Cache 3072KB **Operating System** : Ubuntu 15.10 Wily Werewolf --- **Feature**: * MMX instructions * SSE / Streaming SIMD Extensions * SSE2 / Streaming SIMD Extensions 2 * SSE3 / Streaming SIMD Extensions 3 * SSSE3 / Supplemental Streaming SIMD Extensions 3 * SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4 ? * AES / Advanced Encryption Standard instructions * AVX / Advanced Vector Extensions * AVX2 / Advanced Vector Extensions 2.0 --- ## Prefetching 相關閱讀 再閱讀了老師所推荐的Modern Microprocessors A 90 Minute Guide!了解Memory Hierarchy 的大概時間差,這樣了解了Prefectching可能增加的效應。 PC Core i*4 Level Size Latency Physical Location L1 cache 32 KB 4 cycles inside each core L2 cache 256 KB 12 cycles beside each core L3 cache 6 MB ~21 cycles shared between all cores L4 E-cache 128 MB ~58 cycles separate eDRAM chip RAM 4+ GB ~117 cycles SDRAM DIMMs on motherboard Swap 100+ GB 10,000+ cycles hard disk or SSD 閱讀"When Prefetching Works, When It Doesn’t, and Why"時了解prefetching有分為CPU prefetching和 Software prefecthing,而不適當的Software prefetching 會影響CPU prefetching 而影響Overall的效能。 內容提到幾個重點使用Prefecthing的時機: 1. 當有Short streamer 而CPU prefectcher 還來不及學會prefectch的Pattern。 2. 不規則資料結構例如 Hash 和Resursive。 3. 可以減少L1 Cache miss的時候。 4. 當過多的資料結構再同時進行運算時超過Hardware Prefetcher的數量。 內容也提到Prefetch的時機點是需要控制的太早Cache可能會被換掉做白工太晚則Pefetching的效應不佳,而Prefecting也可以依需求放到L1 ,L2或L3 Cache就好。 ![](https://i.imgur.com/3UOzEOT.png) 再做Prefetch作業前我想了解一下 Intel CPU的Cache設計狀況,在搜尋後找Linux有提共這類資料查尋的方法。 `grep . /sys/devices/system/cpu/cpu0/cache/index*/*` 在了解後Intel broadwell的一個core cache狀況: * L1 Data cache: 32KB * Cache Line : 64Byte * 一個Block有 : 64個Line * 8 way associative * L2 cache: 256KB * Cache Line : 64Byte * 一個Block有 : 512個Line * 8 way associative * L3 cache: 3072KB * Cache Line : 64Byte * 一個Block有 : 4096個Line * 12 way associative ## Prefetching 程式解析 程式本身並不是很難懂就是建立一個4096x4096 Integer Array做Tranpose的動作,整個Array 的大小約是4096X4096X8個Byte大概128MB,看到這時了解原來4096這個數字是有考量過的數字。 >原本一直把int Array 當成4個Byte,後來照這樣的邏輯去Prefectch的效果很差,一直到後來才想起C語言在Array的處理是以指標的型式做處理,所以是64位元的Pointer是8個Byte。 在所有Array都是align的情況下一行大概是8個Page,而L2 Cache一個block可以Hold一行,而L3的一個block可以hold八行。 而且既有這程式中的Prefechter都只有Prefetch到L2 Cache而不是L1 Cache,Prefech 的方式似乎是讓L1 cache miss時可以在L2找到而不是直接替換L1 Cache,可能是考慮L1很小需要比較精準的計算。 在看Code時就有點納悶為什麼,Write 不用Prefectch? 難道Intel的CPU是write through? 再讀過[Write buffer](https://en.wikipedia.org/wiki/Write_buffer)了解這個程式 中Write會先被到CPU寫到Write buffer中等待對的時機才寫入,所以不用Prefetch。我也做了Prefetch Write 試試看結果果然是比較差 ``` Performance counter stats for ./main_AVX_prefetch' (100 runs): 9,271,268 cache-misses # 81.664 % of all cache refs 11,361,723 cache-references 1,163,935,141 instructions # 1.81 insns per cycle 640,361,365 cycles 0.242706453 seconds time elapsed ( +- 0.11% ) ``` ``` Performance counter stats for './main_AVX_prefetchV2' (100 runs): 8,830,566 cache-misses # 77.818 % of all cache refs 11,323,298 cache-references 1,163,997,144 instructions # 1.77 insns per cycle 658,548,796 cycles 0.248680126 seconds time elapsed ( +- 0.27% ) ``` ## Prefetching 程式修改 在把SEE版本修改為AVX2版本時對AVX load 指令不是這麼熟在參考[許士杰](https://embedded2015.hackpad.com/-Homework-7-8-XZe3c94XjUh)學長的筆記後寫出AVX版和第一版AVX_Prefecth。 ``` clike= void AVX_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 4) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 5) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 6) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 7) *w + x, _MM_HINT_T1); __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpackhi_epi32(I0, I1); __m256i T2 = _mm256_unpacklo_epi32(I2, I3); __m256i T3 = _mm256_unpackhi_epi32(I2, I3); __m256i T4 = _mm256_unpacklo_epi32(I4, I5); __m256i T5 = _mm256_unpackhi_epi32(I4, I5); __m256i T6 = _mm256_unpacklo_epi32(I6, I7); __m256i T7 = _mm256_unpackhi_epi32(I6, I7); I0 = _mm256_unpacklo_epi64(T0, T2); I1 = _mm256_unpackhi_epi64(T0, T2); I2 = _mm256_unpacklo_epi64(T1, T3); I3 = _mm256_unpackhi_epi64(T1, T3); I4 = _mm256_unpacklo_epi64(T4, T6); I5 = _mm256_unpackhi_epi64(T4, T6); I6 = _mm256_unpacklo_epi64(T5, T7); I7 = _mm256_unpackhi_epi64(T5, T7); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3); _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), I3); } } } ``` 在修改Makefile把各個方法獨立成為一個執行檔,並加入perf跑100次後的結果如下: **Naive :** ``` Performance counter stats for './main_naive' (100 runs): 23,910,730 cache-misses # 91.377 % of all cache refs 26,050,626 cache-references 1,448,938,705 instructions # 1.27 insns per cycle 1,150,593,694 cycles 0.432103936 seconds time elapsed ( +- 0.18% ) ``` **SSE :** ``` Performance counter stats for './main_sse' (100 runs): 11,043,650 cache-misses # 86.766 % of all cache refs 12,735,285 cache-references 1,236,998,307 instructions # 1.59 insns per cycle 789,708,333 cycles 0.297204120 seconds time elapsed ( +- 0.10% ) ``` **SSE Prefetch :** ``` Performance counter stats for './main_sse_prefetch' (100 runs): 8,184,734 cache-misses # 83.109 % of all cache refs 9,898,505 cache-references 1,282,988,040 instructions # 2.01 insns per cycle 640,139,684 cycles 0.238750082 seconds time elapsed ( +- 0.12% ) ``` **AVX :** ``` Performance counter stats for './main_AVX' (100 runs): 8,791,762 cache-misses # 85.228 % of all cache refs 10,313,159 cache-references 1,140,901,574 instructions # 1.80 insns per cycle 637,450,976 cycles 0.239936268 seconds time elapsed ( +- 0.11% ) ``` **AVX Prefetching:** ``` Performance counter stats for './main_AVX_prefetch' (100 runs): 9,196,764 cache-misses # 81.035 % of all cache refs 11,320,842 cache-references 1,163,973,064 instructions # 1.87 insns per cycle 619,061,075 cycles 0.235337862 seconds time elapsed ( +- 0.11% ) ``` 這個結果顯示出的頻均時間是 AVX Prefetch > SSE Prefetch > AVX > SSE >Naive 。 在幾次測試後發現似乎因為L2是8 way associatives 所以原本一次Prefetch 8個block 會把一些既有而且還在用的Block換掉這似乎會脫慢一些時間,在幾次測試過後發現把Prefetching的時間分開會變快。還有整個矩陣是128MB所以L3 cache是絕對塞不下的,所以有時做prefecthing需要去Memory搬,如果能先從Memory prefetch到L3 這樣之後要prefetch 到L2快一點。做過這兩個變動後有了 AVX Prefetch V2 速度的確有提升一點,程式碼和結果如下。 **AVX Prefetch V2 Code :** ``` clike= void AVX_prefetchV2_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpackhi_epi32(I0, I1); __m256i T2 = _mm256_unpacklo_epi32(I2, I3); __m256i T3 = _mm256_unpackhi_epi32(I2, I3); __m256i T4 = _mm256_unpacklo_epi32(I4, I5); __m256i T5 = _mm256_unpackhi_epi32(I4, I5); __m256i T6 = _mm256_unpacklo_epi32(I6, I7); __m256i T7 = _mm256_unpackhi_epi32(I6, I7); _mm_prefetch(src+(y + PFDIST + 4) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 5) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 6) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 7) *w + x, _MM_HINT_T1); I0 = _mm256_unpacklo_epi64(T0, T2); I1 = _mm256_unpackhi_epi64(T0, T2); I2 = _mm256_unpacklo_epi64(T1, T3); I3 = _mm256_unpackhi_epi64(T1, T3); I4 = _mm256_unpacklo_epi64(T4, T6); I5 = _mm256_unpackhi_epi64(T4, T6); I6 = _mm256_unpacklo_epi64(T5, T7); I7 = _mm256_unpackhi_epi64(T5, T7); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), I3); _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), I0); _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), I1); _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), I2); _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), I3); _mm_prefetch(src+(y + PFDIST2 + 0) *w + x, _MM_HINT_T2); } } } ``` **AVX Prefetch V2 結果 :** ``` 8,744,585 cache-misses # 83.729 % of all cache refs 10,436,599 cache-references 1,166,862,788 instructions # 1.91 insns per cycle 615,119,198 cycles 0.230133365 seconds time elapsed ( +- 0.12% ) ``` 最後結果AVXPrefetchV2 0.230133365seconds 比 AVXPrefetch 快了約2%,比想像中的小很多這也說明Prefetch還是要考量到成本假如改良幅度比成本還小可能反而會使程式變慢而且CPU也有既有的Prefetching機制隨便亂使用software prefetching機制也會有干擾。 ## 結論 在做過許多嘗試後發現Prefetch的使用跟cache block的大小和associate的degree有很大的關聯。要完全掌握使用時機和prefetch ###### tags: `TotallyWrong` `On going`

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully