Hsin Hung Lin
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # 2016q3 Homework1 (raytracing) contributed by <`TotallyWrong`> ###### tags: `TotallyWrong` `On going` ### ::: Reviewd by <`andy19950`> - 使用很多種方法來嘗試修改 dot_product 很好,但是這份程式碼還有很多修改後可以增加效能的地方,可以詳讀程式碼後嘗試看看其他部份的優化。 - 有把程式碼轉成組合語言來做判斷,如果能夠了解組語的變動的原因後簡單的告訴我們 "為何這樣會造成執行時間增加" 會更好。 - 這份報告比較欠缺圖表分析的方式來顯示不同方式之間的差異,比較不直觀一點。 - 最後就是這個部份的 github 連結是錯的,記得把它修改過來。 --- ## 開發作業環境 **CPU** : Intel Core i5-5200U **Cache size**: L1 Cache 128KB, L2 Cache 512K, L3 Cache 3072KB **Operating System** : Ubuntu 15.10 Wily Werewolf --- **Feature**: * MMX instructions * SSE / Streaming SIMD Extensions * SSE2 / Streaming SIMD Extensions 2 * SSE3 / Streaming SIMD Extensions 3 * SSSE3 / Supplemental Streaming SIMD Extensions 3 * SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4 ? * AES / Advanced Encryption Standard instructions * AVX / Advanced Vector Extensions * AVX2 / Advanced Vector Extensions 2.0 --- ## 使用 Ubuntu 15.10 ### 安狀相關開發工具 ``` $ sudo apt-get update $ sudo apt-get install graphviz $ sudo apt-get install imagemagick ``` --- ### 程式測試 ![](https://i.imgur.com/9rfKh5V.jpg) ---- 跑完第一次Test Run 結果 ``` Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 27.18 0.69 0.69 69646433 0.00 0.00 dot_product 21.27 1.23 0.54 56956357 0.00 0.00 subtract_vector 9.45 1.47 0.24 13861875 0.00 0.00 rayRectangularIntersection 7.48 1.66 0.19 31410180 0.00 0.00 multiply_vector 6.70 1.83 0.17 13861875 0.00 0.00 raySphereIntersection 4.73 1.95 0.12 17836094 0.00 0.00 add_vector 4.73 2.07 0.12 10598450 0.00 0.00 normalize 4.33 2.18 0.11 17821809 0.00 0.00 cross_product 3.94 2.28 0.10 4620625 0.00 0.00 ray_hit_object 1.97 2.33 0.05 1048576 0.00 0.00 ray_color 1.58 2.37 0.04 4221152 0.00 0.00 multiply_vectors 1.18 2.40 0.03 2520791 0.00 0.00 idx_stack_top 1.18 2.43 0.03 2110576 0.00 0.00 localColor 1.18 2.46 0.03 1 0.03 2.53 raytracing 0.79 2.48 0.02 2110576 0.00 0.00 compute_specular_diffuse 0.79 2.50 0.02 1048576 0.00 0.00 rayConstruction 0.39 2.51 0.01 3838091 0.00 0.00 length 0.39 2.52 0.01 1241598 0.00 0.00 refraction 0.39 2.53 0.01 1204003 0.00 0.00 idx_stack_push ``` --- 原始檔案跑完時的gprof_out.txt 結果顯示前幾個就如同老師的影片中顯示 dot_product function 被叫最多次。 ![](https://i.imgur.com/BnMfm7L.png) ![](https://i.imgur.com/nTwrlmL.png) 而利用Perf跑cache-misses就算清空cache,cache- miss的比例還是有時多有時少,但是時間上則大致差不多。 --- 再閱讀[rnic](https://embedded2016.hackpad.com/2016q1-Homework-2A-GalzL151aZc)學長的筆記,我也嘗試了建立利gprof2dot建立CallGraph![](https://i.imgur.com/qOPcUWh.png)來使用對了解來說很好用。 --- 使用gprof2dot前請先裝 python ``` sudo apt-get install python graphviz ``` 使用 Xdot 請先安裝 ``` sudo apt-get install xdot ``` --- ### 嘗試改善方法 ### 方法一: ### 嘗依如老師再影片中所提到的強制compiler進行inline的作法,在Google搜尋後找到的[語法](https://gcc.gnu.org/onlinedocs/gcc/Inline.html) ![](https://i.imgur.com/yWrycQd.png) --- 結果大約有0.1秒的改進,但似乎點難已確認因為測試多次結果都有浮動,在改善很小時難已確認。 >有發現似乎電腦再執行他程式的話對cache test 的時間結果有影響 ### 方法二: ### 接下來嘗試把幾個主要的記算函式的回圈展開,這個作法主要是為了減少branch所造成的Control Harzard。我把最常見的前3個Function dot_product,subtract_vector, multiply_vector都Urolling 。 --- ``` Performance counter stats for './raytracing' (50 runs): 122,126 cache-misses # 22.979 % of all cache refs 552,830 cache-references 11,826,508,383 instructions # 2.25 insns per cycle 5,238,499,149 cycles 1.969861040 seconds time elapsed ( +- 0.19% ) ``` ![](https://i.imgur.com/IX8DpNj.png) 在加總方法一和方法二整體約改善一秒鐘。 --- ### 方法三 : ### 接下來就繼續嘗試使用SIMD,雖然在影片中似乎有人嘗試過而效果並不好,但還是需要了解一下使用方法而且測試一下。[**AVX**](https://software.intel.com/en-us/articles/using-avx-without-writing-avx-code) 在讀過相關資料後了解大致上,Scalar 就是拿一個比較大的記憶體來裝數個較小的值來進行同時運算, 以下兩個網頁是讓我了解SIMD和如何在GCC中使用這種記憶體。 --- 第3個網頁是搜尋到有人對dot_product 的研究,似乎AVX在相當大的Vector的dot_product還是有幫助。 [**什麼是SIMD,**](https://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html) [**怎麼在GCC上Code_Intel_SIMD,**](http://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX) [**Dot Product 的改善**](http://blog.theincredibleholk.org/blog/2012/12/10/optimizing-dot-product/) ![](https://i.imgur.com/y6MyamJ.png) --- Dot Product 的程式碼被改成如下 ```clike= inline double dot_product(const double *v1, const double *v2) __attribute__((always_inline)); double dot_product(const double *v1, const double *v2) { double dp=0.0; /* Initialize AVX vector */ __m256d v1vector = _mm256_set_pd(v1[0],v1[1],v1[2],0); __m256d v2vector = _mm256_set_pd(v2[0],v2[1],v2[2],0); /* Compute the product between the two vectors */ __m256d result = _mm256_mul_pd(v1vector,v2vector); /* Sum the elements of the result vector */ double* f = (double*)&result ; dp= f[0]+f[1]+f[2]+f[3] ; return dp; } ``` --- 經過測試時間變長了 ``` #Rendering scene Done! Execution time of raytracing() : 4.320740 sec ``` ![](https://i.imgur.com/ggLmDqm.png) 而檢查Assembly Code 似乎建構SIMD的Assembly Code很長跟原始資表雖然幾乎沒有使用到Jmp的code,不知道這是否可以解釋為使用這種資料結構的成本不小。 ``` //static inline inline double dot_product(const double *v1, const double *v2) __attribute__((always_inline)); double dot_product(const double *v1, const double *v2) { 3f4: 4c 8d 54 24 08 lea r10,[rsp+0x8] 3f9: 48 83 e4 e0 and rsp,0xffffffffffffffe0 3fd: 41 ff 72 f8 push QWORD PTR [r10-0x8] 401: 55 push rbp 402: 48 89 e5 mov rbp,rsp 405: 41 52 push r10 407: 48 81 ec 28 01 00 00 sub rsp,0x128 40e: 48 89 bd d8 fe ff ff mov QWORD PTR [rbp-0x128],rdi 415: 48 89 b5 d0 fe ff ff mov QWORD PTR [rbp-0x130],rsi 41c: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28 423: 00 00 425: 48 89 45 e8 mov QWORD PTR [rbp-0x18],rax 429: 31 c0 xor eax,eax double dp=0.0; 42b: c5 f9 57 c0 vxorpd xmm0,xmm0,xmm0 42f: c5 fb 11 85 e0 fe ff vmovsd QWORD PTR [rbp-0x120],xmm0 436: ff /* Initialize AVX vector */ __m256d v1vector = _mm256_set_pd(v1[0],v1[1],v1[2],0); 437: 48 8b 85 d8 fe ff ff mov rax,QWORD PTR [rbp-0x128] 43e: 48 83 c0 10 add rax,0x10 442: c5 fb 10 00 vmovsd xmm0,QWORD PTR [rax] 446: 48 8b 85 d8 fe ff ff mov rax,QWORD PTR [rbp-0x128] 44d: 48 83 c0 08 add rax,0x8 451: c5 fb 10 08 vmovsd xmm1,QWORD PTR [rax] 455: 48 8b 85 d8 fe ff ff mov rax,QWORD PTR [rbp-0x128] 45c: c5 fb 10 10 vmovsd xmm2,QWORD PTR [rax] 460: c5 fb 11 95 e8 fe ff vmovsd QWORD PTR [rbp-0x118],xmm2 467: ff 468: c5 fb 11 8d 18 ff ff vmovsd QWORD PTR [rbp-0xe8],xmm1 46f: ff 470: c5 fb 11 85 20 ff ff vmovsd QWORD PTR [rbp-0xe0],xmm0 477: ff 478: c5 f9 57 c0 vxorpd xmm0,xmm0,xmm0 47c: c5 fb 11 85 28 ff ff vmovsd QWORD PTR [rbp-0xd8],xmm0 483: ff /* Create the vector [A B C D]. */ extern __inline __m256d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm256_set_pd (double __A, double __B, double __C, double __D) { return __extension__ (__m256d){ __D, __C, __B, __A }; 484: c5 fb 10 85 e8 fe ff vmovsd xmm0,QWORD PTR [rbp-0x118] 48b: ff 48c: c5 fb 10 8d 18 ff ff vmovsd xmm1,QWORD PTR [rbp-0xe8] 493: ff 494: c5 f1 14 c8 vunpcklpd xmm1,xmm1,xmm0 498: c5 fb 10 85 20 ff ff vmovsd xmm0,QWORD PTR [rbp-0xe0] 49f: ff 4a0: c5 fb 10 95 28 ff ff vmovsd xmm2,QWORD PTR [rbp-0xd8] 4a7: ff 4a8: c5 e9 14 c0 vunpcklpd xmm0,xmm2,xmm0 4ac: c4 e3 7d 18 c1 01 vinsertf128 ymm0,ymm0,xmm1,0x1 4b2: c5 fd 29 85 50 ff ff vmovapd YMMWORD PTR [rbp-0xb0],ymm0 4b9: ff __m256d v2vector = _mm256_set_pd(v2[0],v2[1],v2[2],0); 4ba: 48 8b 85 d0 fe ff ff mov rax,QWORD PTR [rbp-0x130] 4c1: 48 83 c0 10 add rax,0x10 4c5: c5 fb 10 00 vmovsd xmm0,QWORD PTR [rax] 4c9: 48 8b 85 d0 fe ff ff mov rax,QWORD PTR [rbp-0x130] 4d0: 48 83 c0 08 add rax,0x8 4d4: c5 fb 10 08 vmovsd xmm1,QWORD PTR [rax] 4d8: 48 8b 85 d0 fe ff ff mov rax,QWORD PTR [rbp-0x130] 4df: c5 fb 10 10 vmovsd xmm2,QWORD PTR [rax] 4e3: c5 fb 11 95 f0 fe ff vmovsd QWORD PTR [rbp-0x110],xmm2 4ea: ff 4eb: c5 fb 11 8d 00 ff ff vmovsd QWORD PTR [rbp-0x100],xmm1 4f2: ff 4f3: c5 fb 11 85 08 ff ff vmovsd QWORD PTR [rbp-0xf8],xmm0 4fa: ff 4fb: c5 f9 57 c0 vxorpd xmm0,xmm0,xmm0 4ff: c5 fb 11 85 10 ff ff vmovsd QWORD PTR [rbp-0xf0],xmm0 506: ff 507: c5 fb 10 85 f0 fe ff vmovsd xmm0,QWORD PTR [rbp-0x110] 50e: ff 50f: c5 fb 10 8d 00 ff ff vmovsd xmm1,QWORD PTR [rbp-0x100] 516: ff 517: c5 f1 14 c8 vunpcklpd xmm1,xmm1,xmm0 51b: c5 fb 10 85 08 ff ff vmovsd xmm0,QWORD PTR [rbp-0xf8] 522: ff 523: c5 fb 10 95 10 ff ff vmovsd xmm2,QWORD PTR [rbp-0xf0] 52a: ff 52b: c5 e9 14 c0 vunpcklpd xmm0,xmm2,xmm0 52f: c4 e3 7d 18 c1 01 vinsertf128 ymm0,ymm0,xmm1,0x1 535: c5 fd 29 85 70 ff ff vmovapd YMMWORD PTR [rbp-0x90],ymm0 53c: ff 53d: c5 fd 28 85 50 ff ff vmovapd ymm0,YMMWORD PTR [rbp-0xb0] 544: ff 545: c5 fd 29 45 90 vmovapd YMMWORD PTR [rbp-0x70],ymm0 54a: c5 fd 28 85 70 ff ff vmovapd ymm0,YMMWORD PTR [rbp-0x90] 551: ff 552: c5 fd 29 45 b0 vmovapd YMMWORD PTR [rbp-0x50],ymm0 } extern __inline __m256d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm256_mul_pd (__m256d __A, __m256d __B) { return (__m256d) ((__v4df)__A * (__v4df)__B); 557: c5 fd 28 45 90 vmovapd ymm0,YMMWORD PTR [rbp-0x70] 55c: c5 fd 59 45 b0 vmulpd ymm0,ymm0,YMMWORD PTR [rbp-0x50] /* Compute the product between the two vectors */ __m256d result = _mm256_mul_pd(v1vector,v2vector); 561: c5 fd 29 85 30 ff ff vmovapd YMMWORD PTR [rbp-0xd0],ymm0 568: ff /* Sum the elements of the result vector */ double* f = (double*)&result ; 569: 48 8d 85 30 ff ff ff lea rax,[rbp-0xd0] 570: 48 89 85 f8 fe ff ff mov QWORD PTR [rbp-0x108],rax dp= f[0]+f[1]+f[2]+f[3] ; 577: 48 8b 85 f8 fe ff ff mov rax,QWORD PTR [rbp-0x108] 57e: c5 fb 10 08 vmovsd xmm1,QWORD PTR [rax] 582: 48 8b 85 f8 fe ff ff mov rax,QWORD PTR [rbp-0x108] 589: 48 83 c0 08 add rax,0x8 58d: c5 fb 10 00 vmovsd xmm0,QWORD PTR [rax] 591: c5 f3 58 c0 vaddsd xmm0,xmm1,xmm0 595: 48 8b 85 f8 fe ff ff mov rax,QWORD PTR [rbp-0x108] 59c: 48 83 c0 10 add rax,0x10 5a0: c5 fb 10 08 vmovsd xmm1,QWORD PTR [rax] 5a4: c5 fb 58 c1 vaddsd xmm0,xmm0,xmm1 5a8: 48 8b 85 f8 fe ff ff mov rax,QWORD PTR [rbp-0x108] 5af: 48 83 c0 18 add rax,0x18 5b3: c5 fb 10 08 vmovsd xmm1,QWORD PTR [rax] 5b7: c5 fb 58 c1 vaddsd xmm0,xmm0,xmm1 5bb: c5 fb 11 85 e0 fe ff vmovsd QWORD PTR [rbp-0x120],xmm0 5c2: ff return dp; 5c3: c5 fb 10 85 e0 fe ff vmovsd xmm0,QWORD PTR [rbp-0x120] 5ca: ff } 5cb: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18] 5cf: 64 48 33 04 25 28 00 xor rax,QWORD PTR fs:0x28 5d6: 00 00 5d8: 74 05 je 5df <dot_product+0x1eb> 5da: e8 00 00 00 00 call 5df <dot_product+0x1eb> 5df: 48 81 c4 28 01 00 00 add rsp,0x128 5e6: 41 5a pop r10 5e8: 5d pop rbp 5e9: 49 8d 62 f8 lea rsp,[r10-0x8] 5ed: c3 ret ``` --- ### 方法四 : ### Software Pipelinling [**SoPiP**](http://www.cs.cmu.edu/afs/cs/academic/class/15745-s12/public/lectures/L19-Software-Pipelining-1up.pdf) [**DotPSoPip**](http://www.fic.udc.es/files/asignaturas/81STR/anexos/dsp6m09.pdf) ```clike= static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; for (int i = 0; i < 3; i++) dp+=v1[i]*v2[i]; return dp; } ``` 原始的Dot Prodcut碼。 --- ```clike= static inline double dot_product(const double *v1, const double *v2) { double dp = 0.0; double pro ; for (int i = 0; i < 3; i++) { pro = v1[i] * v2[i]; dp = dp + pro; } return dp; } ``` 嘗試Software Pipeline。 --- Software Pipeline Dot Product 的效果似乎不是很好甚至比原始作法差,也許是我沒有做成功。 ![](https://i.imgur.com/JFL1yMd.png) --- 試著Output assembly code 看看Software Pipeline 是否有成功 [**GCC Assembly Code**](http://stackoverflow.com/questions/1289881/using-gcc-to-produce-readable-assembly) ``` objdump -d -M intel -S raytracing.o > AssembOr.txt ``` --- 發現因為還要加作一個Pro的空間和多做加的動作,也許是因Software Pipeline的目的是增加Throughput而因為這個Forloop只跑三圈所以能夠增加的Throughput有限所以顯現不出效果。 ``` for (int i = 0; i < 3; i++) 409: c7 45 f4 00 00 00 00 mov DWORD PTR [rbp-0xc],0x0 410: eb 46 jmp 458 <dot_product+0x64> dp+=v1[i]*v2[i]; 412: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc] 415: 48 98 cdqe 417: 48 8d 14 c5 00 00 00 lea rdx,[rax*8+0x0] 41e: 00 41f: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18] 423: 48 01 d0 add rax,rdx 426: f2 0f 10 08 movsd xmm1,QWORD PTR [rax] 42a: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc] 42d: 48 98 cdqe 42f: 48 8d 14 c5 00 00 00 lea rdx,[rax*8+0x0] 436: 00 437: 48 8b 45 e0 mov rax,QWORD PTR [rbp-0x20] 43b: 48 01 d0 add rax,rdx 43e: f2 0f 10 00 movsd xmm0,QWORD PTR [rax] 442: f2 0f 59 c1 mulsd xmm0,xmm1 446: f2 0f 10 4d f8 movsd xmm1,QWORD PTR [rbp-0x8] 44b: f2 0f 58 c1 addsd xmm0,xmm1 44f: f2 0f 11 45 f8 movsd QWORD PTR [rbp-0x8],xmm0 ``` 原始Code的Assembly ``` double pro ; for (int i = 0; i < 3; i++) 409: c7 45 ec 00 00 00 00 mov DWORD PTR [rbp-0x14],0x0 410: eb 4c jmp 45e <dot_product+0x6a> { pro = v1[i] * v2[i]; 412: 8b 45 ec mov eax,DWORD PTR [rbp-0x14] 415: 48 98 cdqe 417: 48 8d 14 c5 00 00 00 lea rdx,[rax*8+0x0] 41e: 00 41f: 48 8b 45 d8 mov rax,QWORD PTR [rbp-0x28] 423: 48 01 d0 add rax,rdx 426: f2 0f 10 08 movsd xmm1,QWORD PTR [rax] 42a: 8b 45 ec mov eax,DWORD PTR [rbp-0x14] 42d: 48 98 cdqe 42f: 48 8d 14 c5 00 00 00 lea rdx,[rax*8+0x0] 436: 00 437: 48 8b 45 d0 mov rax,QWORD PTR [rbp-0x30] 43b: 48 01 d0 add rax,rdx 43e: f2 0f 10 00 movsd xmm0,QWORD PTR [rax] 442: f2 0f 59 c1 mulsd xmm0,xmm1 446: f2 0f 11 45 f8 movsd QWORD PTR [rbp-0x8],xmm0 dp += pro; 44b: f2 0f 10 45 f0 movsd xmm0,QWORD PTR [rbp-0x10] 450: f2 0f 58 45 f8 addsd xmm0,QWORD PTR [rbp-0x8] 455: f2 0f 11 45 f0 movsd QWORD PTR [rbp-0x10],xmm0 ``` SoftWare Pipline的Assmbly 在閱讀**SoPiP**時我也看到那位講師是把Software Pipeline跟Loop Unrolling做比較而他說Software Pipeline 比較好是因為 Unrolling 的技巧還是要branch,但是目前我們dotProduct只有三圈所以可以完全Unroll也許因為這樣要做出比Urolling 更有效的Software Pipeline 大概不太可能。 --- ### 方法五 : ### 在作業說明中還有OpenMP還有POSIX Thread 還沒有嘗試,先試試OpenMP 完全不懂OpenMP,所以用我最常解決問題的方法[Youtube](https://www.youtube.com/watch?v=nE-xN4Bf8XI&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG) ``` gcc -fopenmp filename.c ``` 要Compile OpenMP 只要Flag fopemp ``` export OMP_NUM_THREADS=4 ``` 設定線程的數量 --- 在看了部份Inetel Youtube 的課程前就現來試試看,在看之前的Callgraph 顯示約 94%的 Function Call都是ray_color所以因此試著在此Funcation做Parallel。

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully