魏孜昀
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Natural merge sort 在特定硬體的加速 contributed by <`petermouse`>, <`heathcliffYang`>, <`janetwei`> 程式碼: [natural-mergesort](https://github.com/petermouse/natural-mergesort) [Youtube](https://youtu.be/NiAH4nSldcs) ## 目標 - 改良 Natural merge sort 的 code,注意 branch prediction 的影響 - 嘗試使用硬體提供的 clz 指令來改寫 ## 討論 ### `__builtin_clz` 與 `LZCNT` ? 我們看到 x86 有提供組語的指令 `LZCNT` 可以算 clz,但是手冊上提供的 C 版函式需要用 intel 的 compiler, gcc 並不提供,後來又發現給 gcc 的函式`__builtin_clz`,因此測試 gcc 的 compiler 在編譯的時候,到底會不會編成`LZCNT`? > 從 gcc 的參數`-g -Wa,-adhl` 可以對照組語 >> 用 `objdump -d` 觀察 [name=jserv] >> OK!謝謝老師! [name=heathcliffYang] ### `__builtin_clz` 與 `LZCNT` 測試 `objdump` 可以顯示有關於目的檔的資訊,其中的 `-d` 參數代表對檔案反組譯( disassemble ) 因此我寫了個簡單的程式測試 gcc 內建的 `__builtin_clz` 是否會變成 `LZCNT` ```c= #include <stdio.h> int main() { unsigned int x=100; int clz; clz = __builtin_clz(x); printf("x:%d clz(x):%d\n", x, clz); return 0; } ``` 編譯過後對檔案反組譯:因為很長只列一小段 ``` >0000000000400526 <main>: 400526: 55 push %rbp 400527: 48 89 e5 mov %rsp,%rbp 40052a: 48 83 ec 10 sub $0x10,%rsp 40052e: c7 45 f8 64 00 00 00 movl $0x64,-0x8(%rbp) 400535: 0f bd 45 f8 bsr -0x8(%rbp),%eax 400539: 83 f0 1f xor $0x1f,%eax 40053c: 89 45 fc mov %eax,-0x4(%rbp) 40053f: 8b 55 fc mov -0x4(%rbp),%edx 400542: 8b 45 f8 mov -0x8(%rbp),%eax 400545: 89 c6 mov %eax,%esi 400547: bf e4 05 40 00 mov $0x4005e4,%edi 40054c: b8 00 00 00 00 mov $0x0,%eax 400551: e8 aa fe ff ff callq 400400 <printf@plt> 400556: b8 00 00 00 00 mov $0x0,%eax 40055b: c9 leaveq 40055c: c3 retq 40055d: 0f 1f 00 nopl (%rax) ``` * 整份沒有任何關於`lzcnt`的指令, gcc 的 `__buildin_clz` 並沒有使用到 `lzcnt` * 可以注意到` 400535: 0f bd 45 f8 bsr -0x8(%rbp),%eax` * [bsr](http://x86.renejeschke.de/html/file_module_x86_id_20.html) 是 bit scan reverse 的縮寫,在 x86 指令集中代表找尋 most significant set bit (也就是第一個1)的位置,所以 clz 是根據 `bsr` 求得 #### 關於`lzcnt` 與 `bsr` 的差異 * `bsr`: >==Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand (first operand)==. The source operand can be a register or a memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source operand. ==If the content source operand is 0, the content of the destination operand is undefined==. * `lzcnt`: >==Counts the number of leading most significant zero bits in a source operand (second operand) returning the result into a destination (first operand).== LZCNT differs from BSR. For example, ==LZCNT will produce the operand size when the input operand is zero.== ==It should be noted that on processors that do not support LZCNT, the instruction byte encoding is executed as BSR.== In 64-bit mode 64-bit operand size requires REX.W=1. * `lzcnt` 的 flag affect: >ZF flag is set to 1 in case of zero output (most significant bit of the source is set), and to 0 otherwise, CF flag is set to 1 if input was zero and cleared otherwise. OF, SF, PF and AF flags are undefined. [SSE4的Wikipedia](https://en.wikipedia.org/wiki/SSE4) (註: `lzcnt` 雖然不屬於 SIMD 指令,卻是在與 SSE4 同時期引入的) >The result of lzcnt is ==31 minus the result of the bsr (bit scan reverse), except when the input is 0==. ==lzcnt produces a result of 32, while bsr produces an undefined result==(and sets the zero flag). The encoding of lzcnt is similar enough to bsr that if lzcnt is performed on a CPU not supporting it such as Intel CPU's prior to Haswell, it will perform the bsr operation instead of raising an invalid instruction error. 另外在很多[硬體指令或軟體提供的函式](https://en.wikipedia.org/wiki/Find_first_set#Hardware_support)(~~其實幾乎都是~~ 更正:部份intel的處理器),如果輸入為0都是未定義行為,不太懂硬體在發展上有什麼考量或困難,使得0沒辦法做處理呢? >> 因為硬體不會知道你預期的資料表示法 (data representation),像是 C 語言的 `uint32_t` 和 `uint64_t` 編譯為機械碼之後,就喪失了原始資訊,致使 `clz(0)` 的輸入無法確認是 16 個 0、32 個 0,還是 64 個 0 [name=jserv] >>老師我原本看錯,以為`bsr`與`lzcnt`都不行處理 0,後來發現只有`bsr`是 undefined behavior ,`lzcnt`會回傳input size,`lzcnt`的狀況好像跟老師說的不太一樣嗎?倒是軟體提供的幾乎沒辦法提供正確結果,也是類似老師所說的嗎?[name=petermouse] >> 不要隨便說「好像」,不要跟文組的人一樣「憑感覺」。這些指令都有明確定義,要先確定 gcc 的 `__builtin_clz` 到底用產生哪些指令 (去檢驗手冊和原始碼),然後再去找 Intel 指令說明,像是 [LZCNT](http://www.felixcloutier.com/x86/LZCNT.html) 和 [Bit Manipulation Instruction Sets](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets) [name=jserv] ### 改成 __builtin_clz 的測試 ![](https://i.imgur.com/jymCAYb.png) 在只有改`__builtin_clz`的狀況下,其實根本沒有改善的差別,因為他在natural-mergesort裡面佔的部份很少。 ### 分析程式熱點 用`perf record ./natural-mergesort` 在用 `perf report` 觀察結果 ``` Samples: 1K of event 'cycles:pp', Event count (approx.): 1488328678 Overhead Command Shared Object Symbol 31.95% natural-mergeso natural-mergesort [.] merge 14.49% natural-mergeso natural-mergesort [.] integer_cmp 13.41% natural-mergeso libc-2.23.so [.] _IO_vfscanf 6.45% natural-mergeso libc-2.23.so [.] vfprintf 2.66% natural-mergeso libc-2.23.so [.] 0x000000000014d45b 2.12% natural-mergeso libc-2.23.so [.] 0x000000000003b6da 2.03% natural-mergeso libc-2.23.so [.] __isoc99_fscanf 1.98% natural-mergeso libc-2.23.so [.] _IO_file_xsputn 1.72% natural-mergeso libc-2.23.so [.] 0x000000000003b6f0 1.47% natural-mergeso natural-mergesort [.] build_run_length_queue 1.24% natural-mergeso natural-mergesort [.] memcpy@plt 1.24% natural-mergeso natural-mergesort [.] main 1.08% natural-mergeso libc-2.23.so [.] 0x000000000014d2c3 1.08% natural-mergeso libc-2.23.so [.] 0x000000000014d45d 1.02% natural-mergeso libc-2.23.so [.] 0x000000000014d433 0.85% natural-mergeso libc-2.23.so [.] _IO_sputbackc 0.85% natural-mergeso libc-2.23.so [.] strchrnul 0.79% natural-mergeso natural-mergesort [.] run_length_queue_dequeue 0.74% natural-mergeso natural-mergesort [.] run_length_queue_enqueue 0.74% natural-mergeso libc-2.23.so [.] 0x000000000003b6d5 ``` 可以發現到 `merge` 是最花時間的地方,再來是 comparator - [ ] merge() **1. 連續小於的偵測** 在這個函式中,是逐個 element ,也就是以 integer 為單位比較,並且做 memcpy 。但是這會叫用很多次 memcpy ,所以可以試著讓 copy 的區塊大一點。 **2. Segment 分級** 畢竟 merge 要用到的比較很多,小到一定程度的 segment 就不必一定要用到 merge 來做,而可以用簡單的 qsort 來代替 >quick sort是不穩定排序 merge sort是穩定排序,如果需要穩定性替換就不能用quick sort了 [name=petermouse] testcase : random 1000000 10000 ```c int conti_times_right = 0, conti_times_left = 0; while (left + conti_times_left < left_bound && right + conti_times_right < right_bound) { if (cmp(((char*) source) + size * (right + conti_times_right), ((char*) source) + size * (left + conti_times_left)) < 0) { //cmp() means? //A: user-defined comparasion function if (conti_times_left) { memcpy(((char*) target) + size * target_index, ((char*) source) + size * left, size * conti_times_left); left += conti_times_left; target_index += conti_times_left; conti_times_left = 0; } conti_times_right++; } else { if (conti_times_right) { memcpy(((char*) target) + size * target_index, ((char*) source) + size * right, size * conti_times_right); right += conti_times_right; target_index += conti_times_right; conti_times_right = 0; } conti_times_left++; } } if (conti_times_left) { memcpy(((char*) target) + size * target_index, ((char*) source) + size * left, size * conti_times_left); left += conti_times_left; target_index += conti_times_left; } else if (conti_times_right) { memcpy(((char*) target) + size * target_index, ((char*) source) + size * right, size*conti_times_right); right += conti_times_right; target_index += conti_times_right; } ``` >> TODO: 參考 [Timsort](https://en.wikipedia.org/wiki/Timsort),思考混合兩種以上 stable sorting algorithm 的機會。可透過 C macro 來建立 sorting template,參見 [sort](https://github.com/swenson/sort) [name=jserv] --- ### call graph - ori testcase - random - 1000000 ![](https://i.imgur.com/MxwqeqL.png) testcase - random - 10000000 ![](https://i.imgur.com/Ubs9yGY.png) - opt testcase - random - 10000000 ![](https://i.imgur.com/oVyylGZ.png) **inline cmp** 因為cmp感覺是必要叫這麼多次的函式,目前也想不出其他可以減少的方法,所以選擇把 merge 部份的 cmp 做 inline ```c inline __attribute__((always_inline)) int cmp(const void *a,const void *b) { return *(int *)a - *(int *)b; } ``` ![](https://i.imgur.com/rW0CpER.png) ### 建立測試資料 為了分析不同的測試資料會如何影響結果,所以建立了四種資料的形式 1. random : 資料完全隨機 2. sorted : 資料已經排序好(由小到大) 3. nearly-sorted : 資料呈現部份排序的樣子 4. reversed : 資料已經排序好(由大到小) ### 到目前為止所做的優化結果: 在不同種資料類型的mergesort時間,以及優化結果 資料為一百萬, range 為 0 到 50000000 ![](https://i.imgur.com/Zp4upty.png) 以下為在資料類型是 random number 時在不同 range 下取1000個資料所得到的時間以及優化結果 ![](https://i.imgur.com/0e7V7S4.png) 在相同資料數量下, random number 在不同 range 下排列的時間是相近的,優化過後的結果更是趨於穩定 ### branch predictor 記得在[Modern Microprocessors](http://www.lighterra.com/papers/modernmicroprocessors/) 提到一個小概念: ```clike if (a > 7) { b = c; } else { b = d; } ``` * 原始組語 ``` cmp a, 7 ; a > 7 ? ble L1 mov c, b ; b = c br L2 L1: mov d, b ; b = d L2: ... ``` * Eliminating Branches with Predication ``` cmp a, 7 ; a > 7 ? mov c, b ; b = c cmovle d, b ; if le, then b = d ``` ==不先判斷,而是先做如果式子成立的結果,再來判斷如果式子不成立的話要 override== ---> 試試看能不能把這種概念應用上去以減少判斷 且這個概念在[第18題 branch predictor](https://embedded2016.hackpad.com/2016q1-Week-2--Sc7AmIvN7EN#:h=%E7%AC%AC18%E9%A1%8C:-branch-predictor)中有類似的 ```c while (i1 < n && i2 < n) { int v1 = src1[i1]; int v2 = src2[i2]; int take1 = v1 < v2; dest[id++] = take1 ? v1 : v2; i1 += take1; i2 += (1 - take1); } ``` 但是要注意的是,merge 這個動作總共有三種寫法:origin, teacher version, student version 而詳細的探討要考慮到 ==gcc 的最佳化==,以及 CPU 架構(不確定能不能探討到) >> 實際的結果呢?如何設計實驗? [name=jserv] * gcc的參數 `-fno-guess-branch-probability` Do not guess branch probabilities using heuristics. `-O0`表示無最佳化 `-O1`到`-O3`表示最佳化不同的程度 [參考資料](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) ## 兩份參考文件的閱讀 ### 第一篇 :**uop cache** ## [Segmented Sort and Locality Sort](https://nvlabs.github.io/moderngpu/segsort.html) 我們看了幾次這份文件,本來以為 segmented sort 跟 locality sort 也是兩種 sort 的方法 ( 演算法或處理方法 ) ,但是發覺其實不是,這兩個 sort 是表現我們在 programming 時可能會面臨到的問題。 Segmented sort 與 locality sort : merge sort 的變形,使用在分配不平均的亂數資料上( non-uniform random data ) ### 1. Segmented Sort Segmented sort allows us to sort many variable-length arrays in parallel. ->將 sort 分成幾個 block 同時排列 ![](https://i.imgur.com/WxTmzUQ.png) 圖片顯示如果平均 segment (資料長度)在較小的情況,平行處理時有較高的效能,反之如果 segment 太大,反而比一般 mergesort 效能低 * Segmented sort pairs 一組陣列的值對應到一個位置, sort 之後的陣列數值排列,因此也要將一開始的位置對應到 sort 後的值 >> 這是我們對 code 的理解,但仍不知道會應用在哪 >> >>source code 的連結掉了,不知道裡面是否有應用的內容,而且github提供fork的repository找不到對應的檔案 >>[name=魏孜昀] * Algorithm 解釋 : 文章後半段有討論到 segmented 想呈現的問題 一個單一的問題很好解決,但是當我們面臨到的是小區小區破碎的問題時,就要有不同手段的考量,就像當我們在 partition 不同的資料區間之後,不可能不論長度大小都用要叫多個 thread 來 sort ,因為 CP 值不符。 ==未完== ### 2. Locality Sort ~~Locality sort : 可以偵測接近排序的資料範圍~~ ![](https://i.imgur.com/KcSNXqW.png) 這是關於 locality sort 的 benchmark,如果 noise amplitude 愈大,效能表現會遇差 noise amplitude 指的其實是資料上的浮動範圍,浮動範圍愈大會使得 locality sort 的效能降低 如果我們觀察 demo 的 input keys ```clike= for(int i = 0; i < N; ++i) keysHost[i] = i + Rand(0, 25); ``` 呈現的資料: ``` Input keys: 0: 15 26 16 9 26 27 12 16 16 28 10: 13 32 36 18 30 40 28 35 34 44 20: 34 40 38 28 38 34 44 32 41 50 30: 55 55 37 52 36 57 38 48 39 47 40: 50 62 53 57 53 48 65 52 64 61 50: 70 61 76 72 79 64 60 77 61 84 60: 78 83 64 84 77 74 79 68 90 94 70: 82 92 82 95 91 76 95 77 91 94 80: 89 100 85 99 99 102 92 111 89 95 90: 109 114 98 96 105 103 113 119 107 105 ``` 這個 demo 程式讓資料呈現"有點排序"的樣子,noise amplitude 就是因為 `Rand(0,25)` 所造成資料上的浮動程度。 ### 第二篇 :[Mergesort](https://nvlabs.github.io/moderngpu/mergesort.html) 1. Mergesort is type of ==multi-pass vectorized merge==: the first iteration executes N / 2 merges with inputs of length 1; the second iteration executes N / 4 merges with inputs of length 2; etc. 2. The number of batched merge passes is log(N) and the work per pass is N. This O(N log N) work-efficiency hurts mergesort's scalability compared to radix sort. **number of batched merge passes :** 呼叫下一層 function **work per pass :** 要處理的 list 的長度 所以 O(N log N) 比較 radix sort 的 O(variable x (N + 10)) ,當 N 變大,也就是 scalability 變大,mergesort 所受的影響會比較大 >> 上面整段數學式要說明什麼? 請重新整理式子並說明[color=red][name=課程助教] >> 這個是當初在討論的計算式,忘記刪掉 >> [mgpuhost.cuh:163](https://github.com/ankan-ban/perft_gpu/blob/master/moderngpu-master/include/mgpuhost.cuh) 有提到 The data is sorted in-place. 顛覆我們一般對 mergesort 的認知,不知道是怎麼辦到的,有待確認 [name=petermouse] >>> in-place : 不代表完全不使用額外空間,而是使用小數量或常數數量的額外空間 ( i.e. 空間複雜度 O(1)) 3. This mergesort implementation runs at about half the throughput on large arrays with 32-bit keys as the fastest GPU radix sorts. 4. If the sort is truly on a critical path, it may be worth pulling a radix sort from B40C/FastSortSm20. >> 不太懂第三句的 as the fastest GPU radix sorts 意思 >> 以及第四句的 a critical path 是什麼 >> **CTA-level blocksorts** ?? [name=heathcliffYang] >>> [NVIDIA Compute](https://www.doc.ic.ac.uk/~wl/teachlocal/arch2/papers/nvidia-PTX_ISA_1.0.pdf) 中的 CTA 解釋 : A Cooperative Thread Array, or CTA, is an array of threads that execute a kernel concurrently or in parallel. >>> [CUDPP library](http://cudpp.github.io/cudpp/2.1/dir_c96d66e6db050f7b70147ba9bf6583c2.html) 有一系列的 sort 函式[name=petermouse] >>> [blocksort kernel](https://github.com/OrangeOwlSolutions/CUB/blob/master/BlockSort.cu) >> ### Advantages of mergesort against radix sort 1. mergesort is comparison sort-> 支援多種 type 的 input ,而 radix sort 限制 input 在整數和小數 2. 數字大小不會影響 mergesort 時間複雜度,而數字大小會影響 radix 有幾個 bit 的多寡 3. 提供 CTA-level blocksort 4. 面對已經排序好的 input , mergesort 可以只是單純地將 data 複製,而其只需執行 log(N) ,radix sort 沒有檢查的機制 --- #### 補充 : radix sort 1. 將數值切割進行排序 ---> 92 變成 9 , 2 2. 以 base 來決定要分幾類 ---> 例如要排序的數值是10進位,那 base 就是10,依照單個位數大小分成0~9個類 3. 有幾個位數分幾類,從左到右與從右到左又是兩種方法 4. 最想要分析的 O(位數 x (數列長度 + base )) [參考資料](http://notepad.yehyeh.net/Content/Algorithm/Sort/Radix/Radix.php) --- 5. Mergesort on GPU runs best when written in two distinct stages: a) A blocksort kernel sorts random inputs into tile-length sorted lists, communicating with low-latency, high-bandwidth shared memory .The CTA blocksort forms a convenient re-usable component for MGPU's customers. b) Multiple launches of MGPU Merge iteratively merge sorted lists, starting with the output of 1, communicating between passes with high-latency, high-capacity DRAM. 兩個都用到以下方法: Pairs of threads (or CTAs for the global merge passes) cooperatively merge two VT-length lists (or two NV-length lists) into one list. ---> 覺得很神奇的地方是,居然是 " 一對 " 甚至更多個 thread 來一起處理 merge two list into one 的事 範例: ![](https://i.imgur.com/CkAmycz.png) 圖中每個正方形的上邊與右邊代表的是**已經排序好的資料** 對第一個正方形來說: thread 1 與 thread 2 使用這兩個排序好的資料,排序到下圖中第一個正方形的**上邊** 對第二個正方形來說: thread 3 與 thread 4 使用這兩個排序好的資料,排序到下圖中第一個正方形的**右邊** 接著,下圖 thread 1、2、3、4 會將第一個正方形的上邊與右邊,排序到更大一個正方形的上邊 ![](https://i.imgur.com/kSWYWAs.png) thread 之間並行排序的細節: ![](https://i.imgur.com/4sI43St.png) 在之前的圖中可以看到有條綠色的線從左上向下或向右走到右下,這條線稱作 Merge Path ,它由這兩筆排序好的資料的互相比對來呈現這兩筆資料的關係 ![](https://i.imgur.com/PU36pEU.png) 接著對這張圖使用斜線切割成相等的大小,可以對照到x軸與y軸間,每個斜線之間的資料數目是一樣的 ![](https://i.imgur.com/iL1EGE7.png) 因此根據這個斜線切到的位置 diag 以及 Merge path,可以決定斜線之間的資料應該要放到哪個位置,比如說,從這張圖來看, a0 與 a1 之間的資料,以及 b0 與 b1 之間的資料,可以確定他們將會排在 diag0 與 diag1 之間,同理 a3 與 a4 之間的資料與 b3 與 b4 之間的資料將會排到 diag3 與 diag4 之間,有了區間的資訊,每個 thread 可以取得獨立的資料,排在各自的位置上,使得 thread 可以平行處理 merge 實際上在找尋斜線之間的區間(就是那個 a b 區間)時,並不是真的建立 Merge Path (不然直接 merge 就好啦!),而是用 binary search 來找尋這個區間 >不太清楚 binary search 實際要怎麼應用在這裡 [name=petermouse] --- ### sorting networks blocksort 在每個 CTA 中使用 NT 個 threads ,每個 thread 有 VT 個值, 在 merge 的前幾個回合可以使用 in-register 的 sort 方式 , 分為 1. Batcher's odd-even mergesort (O(n log^2^ n)) 2. odd-even transposition sort (O(n^2)) 兩種在平行處理時有很好的效果,因為沒有使用到 shared memory odd-even transposition sort 雖然比較慢,但是具備穩定性,所以推薦使用 ### blocksort CTAMergesort : 是一個 block-level mergesort 1. OddEvenTransposeSort : 先做簡單的 local sort 2. DeviceThreadToShared : 將 local sorted keys 放入 shared memory 3. DeviceBlocksortLoop : 將 list merge --- --- 6. Note that ==the keys and indices are unioned in shared memory, so as to not waste resources==. This is an idiom used throughput Modern GPU, and an important one to follow if you goal is high throughput. ---> 這個概念在下一章節會被用到 7. Segmented and locality sorts exploit the sortedness of inputs to reduce processes. ---> 比起 mergesort 的較為粗略的 partitioning ,Segmented sort 跟 locality sort 可以找到 " 已排序好的 " input 去減少執行過程 8. Mergesort is a multi-pass, out-of-place algorithm. The blocksort reduces global memory traffic by **sorting blocks of NV elements locally, performing key exchange through low-latency shared memory.** ---> 兩個方法減少資料溝通所需(因為 global 需要被所有 thread 看見) >> 但是不是很了解實際執行的方法 >> 這個要順便複習其他組別研究 prefetch 和計算機組織結構中的 cacheline 對效能的影響 [name=jserv] >> 跟 hugikun999 討論到可以試著用 AVX + 配上用 clz 當 key 來增加一次可以判斷的數值數量,但還沒開始實作 ## out of our dictionary > vanilla mergesort:normal mergesort ### 參考資料 * [intel 64 software developer instruction set reference manual ](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf) * [gcc X86-Built-in-Functions](https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/X86-Built-in-Functions.html) * [gcc implementation of builtin clz](http://stackoverflow.com/questions/9353973/implementation-of-builtin-clz) * [gcc 查組語對照](http://leavinel.blogspot.tw/2012/06/gcc.html) * [gcc 內嵌組語](http://ccckmit.wikidot.com/as:inlinec) * [how undefined are builtin ctz0](http://stackoverflow.com/questions/19527897/how-undefined-are-builtin-ctz0-or-builtin-clz0) * [GNU 的連結工具 (ld, nm, objdump, ar)](http://sp1.wikidot.com/gnulinker) * [objdump](https://sourceware.org/binutils/docs/binutils/objdump.html) * [NVlabs Cub](https://nvlabs.github.io/cub/) * [moderngpu intro](https://nvlabs.github.io/moderngpu/intro.html) * [optimizing merge sort](http://jeffreystedfast.blogspot.tw/2011/04/optimizing-merge-sort.html) * [An Experimental Study of Sorting and Branch Prediction](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.9149&rep=rep1&type=pdf)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully