# 2016q3 Homework1 (raytracing) contributed by <`Quexint`> ## 報告 ### 大綱 - [作業須知](https://hackmd.io/s/B1W5AWza) - [原始專案](https://github.com/sysprog21/raytracing) - [程式專案](https://github.com/Quexint/raytracing) ### 環境 - OS: Arch Linux 4.7.4-1-ARCH - CPU: Intel(R) Core(TM)2 Quad CPU @ 2.40GHz - MEM: 4G - GCC: 6.2.1 ### 規格 給定一段高數值計算的程式專案,在無損答案精度下,求計算時間最短。 ### 實驗 - Loop Unrolling - dot_product - add_vector - substruct_vector - multiply_vector - multiply_vectors - 函式優化 - CBLAS on dot_product - Inverse Square Root on normalize - 平行化 - 平行化模型 - Jobs/Worker - Lock-Free Programming - OMP on Raytracing - OMP on Math-Toolkit - Pthread on Raytraing - AVX SIMD on Math-Toolkie - 編譯器 - Memory Advise - inline on Math-Toolkit - asm on Math-Toolkit ### 結果 ![](https://d17oy1vhnax1f7.cloudfront.net/items/3F1G3t040K1z0g1T0x2g/runtime.png?v=bb895ff7) ## 開發記錄 - 參考 - [Optimization of Computer Programs in C](http://leto.net/docs/C-optimization.php) ### `v0` 原始版本 - 執行時間 ```bash # make Execution time of raytracing() : 6.357352 sec ``` - 評測 ```bash 23.57 1.06 1.06 69646433 0.00 0.00 dot_product 11.56 1.58 0.52 56956357 0.00 0.00 subtract_vector 10.45 2.05 0.47 13861875 0.00 0.00 rayRectangularIntersection ``` ### `v1` 優化 `dot_product` - 使用 Loop Unrolling 優化 ``` return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2]; ``` - 執行時間 ```bash # make Execution time of raytracing() : 5.628290 sec ``` - 評測 ```bash 18.55 0.76 0.76 56956357 0.00 0.00 subtract_vector 13.67 1.32 0.56 10598450 0.00 0.00 normalize 10.98 1.77 0.45 69646433 0.00 0.00 dot_product 10.13 2.19 0.42 13861875 0.00 0.00 rayRectangularIntersection ``` ### `v2x` 用 CBLAS 優化 `dot_product` (X) - 參考 - [CBLAS Code](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/src/cblas/cblas_ddot.c) - [BLAS Library](http://www.netlib.org/blas/) - 以前寫過一點數值的程式,有用過數值的函式庫CBLAS。 - BLAS for Pascal, CBLAS for C - CBLAS 有 LEVEL1 (向量操作), 2(矩陣-向量操作), 3(矩陣-矩陣操作) - `sudo pacman -S cblas` - `Makefile` ``` -lcblas ``` - `math-toolkit.h` ``` #include <cblas.h> #define dot_product(x, y) cblas_ddot(3, x, 1, y, 1) ``` - 執行時間 ``` Execution time of raytracing() : 5.377129 sec ``` - 沒有顯著的提升 ### `v3` 優化 `subtract_vector` - 使用 Loop Unrolling 優化 ``` out[0] = a[0] - b[0]; out[1] = a[1] - b[1]; out[2] = a[2] - b[2]; ``` - 執行時間 ```bash # make Execution time of raytracing() : 4.737710 sec ``` - 評測 ```bash 12.40 0.45 0.45 69646433 0.00 0.00 dot_product 11.71 0.87 0.42 56956357 0.00 0.00 subtract_vector 11.29 1.27 0.41 13861875 0.00 0.00 rayRectangularIntersection 10.59 1.65 0.38 10598450 0.00 0.00 normalize ``` ### `v4` 優化 `add_vector` - 使用 Loop Unrolling 優化 ``` out[0] = a[0] + b[0]; out[1] = a[1] + b[1]; out[2] = a[2] + b[2]; ``` - 執行時間 ```bash # make Execution time of raytracing() : 4.456836 sec ``` - 評測 ```bash 16.10 0.51 0.51 10598450 0.00 0.00 normalize 14.05 0.96 0.45 69646433 0.00 0.00 dot_product 10.73 1.30 0.34 13861875 0.00 0.00 rayRectangularIntersection 9.47 1.60 0.30 56956357 0.00 0.00 subtract_vector 7.42 1.83 0.24 31410180 0.00 0.00 multiply_vector 6.94 2.05 0.22 4620625 0.00 0.00 ray_hit_object 6.31 2.25 0.20 17821809 0.00 0.00 cross_product ``` ### `v5x` 優化 `normalize` (X) - 分析: 強化 sqrt(x) 或是 強化除法 - 參考 - [Is it possible to roll a significantly faster version of sqrt](http://stackoverflow.com/questions/2637700/is-it-possible-to-roll-a-significantly-faster-version-of-sqrt) - [Fast inverse square root](https://en.wikipedia.org/wiki/Fast_inverse_square_root) - [TIMING SQUARE ROOT](http://assemblyrequired.crashworks.org/timing-square-root/) #### Trial 1: Fast Inverse Sqare Root ``` static inline float Q_rsqrt( float number ) { long i; float x2, y; const float threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = * ( long * ) &y; i = 0x5f3759df - ( i >> 1 ); y = * ( float * ) &i; y = y * ( threehalfs - ( x2 * y * y ) ); return y; } static inline void normalize(double *v) { double d = Q_rsqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]); assert(d != 0.0 && "Error calculating normal"); v[0] *= d; v[1] *= d; v[2] *= d; } ``` - 執行時間 ```bash # make Execution time of raytracing() : 2.345482 sec <- 超快 ``` - 精度崩了。 ![image alt](https://d17oy1vhnax1f7.cloudfront.net/items/240Y1l3b2d1S0h0L2F1j/out.png?v=130d6b86) #### Trial 2: 改成乘法 ``` static inline void normalize(double *v) { double d = 1.0/sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]); assert(d != 0.0 && "Error calculating normal"); v[0] *= d; v[1] *= d; v[2] *= d; } ``` - 執行時間 ```bash # make Execution time of raytracing() : 4.386514 sec ``` #### Trial 3: Fast Inverse Sqare Root (double version) ``` static inline double InvSqrt(double number) { long long i; double x2, y; const double threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = * ( long long * ) &y; i = 0x5fe6ec85e7de30da - ( i >> 1 ); y = * ( double * ) &i; y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed return y; } ``` - 執行時間 ```bash # make Execution time of raytracing() : 4.575374 sec ``` - 更慢,精度有準一點但還是不夠。 ![image alt](https://d17oy1vhnax1f7.cloudfront.net/items/3U1d1z2q0P162U362G2e/out2.png?v=757d5f62) ### `v6` 優化 `multiply_vectors` - 使用 Loop Unrolling 優化 ``` out[0] = a[0] * b[0]; out[1] = a[1] * b[1]; out[2] = a[2] * b[2]; ``` - 執行時間 ```bash # make Execution time of raytracing() : 4.356295 sec ``` - 評測 ```bash 13.03 0.44 0.44 31410180 0.00 0.00 multiply_vector 12.28 0.85 0.41 10598450 0.00 0.00 normalize 11.68 1.24 0.39 56956357 0.00 0.00 subtract_vector 11.09 1.61 0.37 13861875 0.00 0.00 rayRectangularIntersection 10.79 1.97 0.36 69646433 0.00 0.00 dot_product ``` ### `v7` 優化 `multiply_vector` - 使用 Loop Unrolling 優化 ``` out[0] = a[0] * b; out[1] = a[1] * b; out[2] = a[2] * b; ``` - 執行時間 ```bash # make Execution time of raytracing() : 3.968329 sec ``` - 評測 ```bash 13.21 0.40 0.40 10598450 0.00 0.00 normalize 12.88 0.79 0.39 13861875 0.00 0.00 rayRectangularIntersection 12.71 1.18 0.39 69646433 0.00 0.00 dot_product 10.73 1.50 0.33 13861875 0.00 0.00 raySphereIntersection ``` ### `v8` OpenMP 平行化 (X) - 參考 - https://computing.llnl.gov/tutorials/openMP/ - http://bisqwit.iki.fi/story/howto/openmp/ - `sudo pacman -S openmp` - `Makefile` ``` CFLAGS = \ -std=gnu99 -Wall -O0 -g -fopenmp LDFLAGS = \ -lm -lomp ``` - `raytracing.c` ``` int HxW = height * width; int i, j; #pragma omp parallel for for(int itr = 0; itr < HxW; itr++) { i = itr % width; j = itr / width; ``` - 執行時間 ```bash # make Execution time of raytracing() : 1.581772 sec ``` - 評測 ```bash 17.30 0.89 0.89 10703749 0.00 0.00 subtract_vector 11.53 1.48 0.59 2033242 0.00 0.00 normalize 11.04 2.04 0.57 13231699 0.00 0.00 dot_product 10.65 2.59 0.55 6265867 0.00 0.00 multiply_vector 8.31 3.01 0.43 2719594 0.00 0.00 rayRectangularIntersection ``` - 顏色崩了,待修 - 找不出原因,改用 pthread 改寫。 ### `v9` Pthread 平行化 - 參考 - [Pthread Library](http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/pthreads.html) - `Makefile` ``` LDFLAGS = \ -lm -lpthread ``` - `raytracing.c` ``` void *work(void *ptr) { entry *info = (entry *) ptr; int i, j; int HxW = (info->height) * (info->width); idx_stack stk; point3 d; int factor = sqrt(SAMPLES); for(int itr = (info->idx); itr < HxW; itr+=NUM_THREAD) { j = itr / (info->width); i = itr % (info->height); double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, info->u, info->v, info->w, i * factor + s / factor, j * factor + s % factor, info->view, (info->width) * factor, (info->height) * factor); if (ray_color(info->view->vrp, 0.0, d, &stk, info->rectangulars, info->spheres, info->lights, info->object_color, MAX_REFLECTION_BOUNCES)) { r += (info->object_color[0]); g += (info->object_color[1]); b += (info->object_color[2]); } else { r += (info->background_color[0]); g += (info->background_color[1]); b += (info->background_color[2]); } info->pixels[((itr) * 3) + 0] = r * 255 / SAMPLES; info->pixels[((itr) * 3) + 1] = g * 255 / SAMPLES; info->pixels[((itr) * 3) + 2] = b * 255 / SAMPLES; } } return NULL; } /* @param background_color this is not ambient light */ void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height) { point3 u, v, w; color object_color = { 0.0, 0.0, 0.0 }; /* calculate u, v, w */ calculateBasisVectors(u, v, w, view); pthread_t *thread_array = (pthread_t *) malloc (sizeof(pthread_t) * NUM_THREAD); entry *message = (entry*) malloc (sizeof(entry) * NUM_THREAD); for(int i = 0; i < NUM_THREAD; i++) { message[i].idx = i; message[i].pixels = pixels; multiply_vector(background_color, 1.0, message[i].background_color); message[i].rectangulars = rectangulars; message[i].spheres = spheres; message[i].lights = lights; message[i].view = view; message[i].width = width; message[i].height = height; multiply_vector(u, 1.0, message[i].u); multiply_vector(v, 1.0, message[i].v); multiply_vector(w, 1.0, message[i].w); multiply_vector(object_color, 1.0, message[i].object_color); pthread_create(&thread_array[i], NULL, work, (void*)&message[i]); pthread_join(thread_array[i], NULL); } } ``` - 執行時間 ```bash # make Execution time of raytracing() : 4.010901 sec ``` - 改用 job/worker? #### 分配工作 - 參考 - [Mutex Lock Code Examples](https://docs.oracle.com/cd/E19683-01/806-6867/sync-12/index.html) - `raytracing.c` ``` const int NUM_THREAD = 4, NUM_WORK = 100; pthread_mutex_t thread_mutex; int jobs; int max(int a, int b) { return a > b ? a : b; } void *work(void *ptr) { int S, E; while(1) { pthread_mutex_lock(&thread_mutex); if(jobs != 0) { E = jobs; jobs = max(jobs - NUM_WORK, 0); S = jobs; } else S = E = -1; pthread_mutex_unlock(&thread_mutex); if(S == -1) break; for(int itr = S; itr < E; itr++) ... } /* @param background_color this is not ambient light */ void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height) { jobs = height * width; ``` - 執行時間 ``` Execution time of raytracing() : 3.977515 sec ``` - 怎麼可能? - 改成先設置好每個執行緒再啟動 ``` for(int i = 0; i < NUM_THREAD; i++) { pthread_create(&thread_array[i], NULL, work, (void*)&message[i]); } for(int i = 0; i < NUM_THREAD; i++) pthread_join(thread_array[i], NULL); ``` - 執行時間 ``` Execution time of raytracing() : 1.002260 sec Execution time of raytracing() : 0.306669 sec <- 如果再加上 -Ofast 超快 ``` - 評測 ``` 17.48 0.80 0.80 8437099 0.09 0.09 subtract_vector 12.23 1.36 0.56 1561304 0.36 0.36 normalize 11.80 1.90 0.54 10395837 0.05 0.05 dot_product 10.38 2.38 0.48 4785927 0.10 0.10 multiply_vector ``` ### `v10x` AVX SIMD 加速 (X) - 參考 - http://wiki.csie.ncku.edu.tw/embedded/2015q3h1ext - [Intrinsics for Intel(R) Advanced Vector Extensions](http://scc.ustc.edu.cn/zlsc/sugon/intel/compiler_c/main_cls/intref_cls/common/intref_bk_advectorext.htm) - 前情提要 - 測試環境 CPU 不支援 AVX ,所以換了一個環境測,以下倍率為估算值(`新環境_T_v10 / 新環境_T_v9`) - `Makefile` ``` CFLAGS = \ -std=gnu99 -Wall -O0 -g -mavx ``` - `math-toolkit.h` ``` #include <immintrin.h> static inline void normalize(double *v) { double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]); assert(d != 0.0 && "Error calculating normal"); double tmp[4] = {d, d, d, 0}; _mm256_store_pd(v, _mm256_div_pd(*(__m256d*)v, *(__m256d*)tmp)); } static inline void add_vector(const double *a, const double *b, double *out) { _mm256_store_pd(out, _mm256_add_pd(*(__m256d*)a, *(__m256d*)b)); } static inline void subtract_vector(const double *a, const double *b, double *out) { _mm256_store_pd(out, _mm256_sub_pd(*(__m256d*)a, *(__m256d*)b)); } static inline void multiply_vectors(const double *a, const double *b, double *out) { _mm256_store_pd(out, _mm256_mul_pd(*(__m256d*)a, *(__m256d*)b)); } static inline void multiply_vector(const double *a, double b, double *out) { double tmp[4] = {b, b, b, 0}; _mm256_store_pd(out, _mm256_mul_pd(*(__m256d*)a, *(__m256d*)tmp)); } ``` - 執行時間 - 變慢:`(0.667676/0.626422) * 1.002260 = 1.068265399` ### `v11x` 再戰 OpenMP (X) - 嘗試將 OMP 套用在 `math-toolkit.h` 的小型函式 - `math-toolkit.h` ``` #pragma omp simd for(int i=0; i < 3; i++) out[i] = a[i] + b[i]; ``` - 執行時間 ``` Execution time of raytracing() : 1.421563 sec ``` ### `v12` 強制開啟 `inline` - 參考 - [How do I force gcc to inline a function?](http://stackoverflow.com/questions/8381293/how-do-i-force-gcc-to-inline-a-function) - [force inline function in other translation unit](http://stackoverflow.com/questions/13228326/force-inline-function-in-other-translation-unit) - `math-toolkit.h` ``` __attribute__((always_inline)) void normalize(double *v) __attribute__((always_inline)) double length(const double *v) __attribute__((always_inline)) void add_vector(const double *a, const double *b, double *out) __attribute__((always_inline)) void subtract_vector(const double *a, const double *b, double *out) __attribute__((always_inline)) void multiply_vectors(const double *a, const double *b, double *out) __attribute__((always_inline)) void multiply_vector(const double *a, double b, double *out) __attribute__((always_inline)) void cross_product(const double *v1, const double *v2, double *out) __attribute__((always_inline)) double dot_product(const double *v1, const double *v2) __attribute__((always_inline)) void scalar_triple_product(const double *u, const double *v, const double *w, double *out) __attribute__((always_inline)) double scalar_triple(const double *u, const double *v, const double *w) ``` - 執行時間 ``` Execution time of raytracing() : 0.876538 sec ``` - 提高 7.252796798 倍 ### `v13x` Linux Kernel Hack (X) - 對於常用的 Memory,嘗試去加快存取的速度。 - 使用 `madvise` / `posix_madvise` 建議 Kernel 對於某塊記憶體如何存取。 - `main.c` ``` #include <sys/mman.h> madvise(pixels, sizeof(unsigned char) * ROWS * COLS * 3, MADV_RANDOM); ``` - 試過 `MADV_RANDOM`, `MADV_WILLNEED` 都沒有提高速度。 ### `v14x` Lock-Free Programming (X) - 參考: - [Built-in functions for atomic memory access](https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/Atomic-Builtins.html) - [How does Compare and Swap work?](http://stackoverflow.com/questions/22339466/how-does-compare-and-swap-work) - [Compare-and-swap](https://en.wikipedia.org/wiki/Compare-and-swap) - 試圖拿掉保護 Jobs 的互斥鎖。 - 採用 GCC 內建的原子操作 ``` while((S = jobs) && __sync_bool_compare_and_swap(&jobs, jobs, jobs - NUM_WORK)); ``` - 執行時間:約5秒多,慢快5倍… ### `v15` 用 `-Ofast` 及 `asm` 作弊 - 參考 - [GCC-Inline-Assembly-HOWTO](http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html#s5) - [Extended Asm - Assembler Instructions with C Expression Operands](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html) - [x86 instruction listings](https://en.wikipedia.org/wiki/X86_instruction_listings) - [Using Inline Assembly in C/C++](http://www.codeproject.com/KB/cpp/edujini_inline_asm.aspx?display=Print) - [Brennan's Guide to Inline Assembly](http://www.delorie.com/djgpp/doc/brennan/brennan_att_inline_djgpp.html) - [how to use C variables in an assembly code in c++ while compiling in gcc?](http://stackoverflow.com/questions/8650729/how-to-use-c-variables-in-an-assembly-code-in-c-while-compiling-in-gcc) - [How to Use Inline Assembly Language in C Code](https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/Using-Assembly-Language-with-C.html#Using-Assembly-Language-with-C) - [Inline Assembly/Examples](http://wiki.osdev.org/Inline_Assembly/Examples) - - 由於開 `-Ofast` 可以將速度提高到 `0.300528sec`,所以可以分析一下是怎麼變快的。 - 用 `objdump -d raytracing | vim -` 對照有無開啟 `-Ofast` - 以 `add_vector` 為例,行數就差好幾倍: ``` 0000000000401135 <add_vector>: 401135: 55 push %rbp 401136: 48 89 e5 mov %rsp,%rbp 401139: 48 89 7d f8 mov %rdi,-0x8(%rbp) 40113d: 48 89 75 f0 mov %rsi,-0x10(%rbp) 401141: 48 89 55 e8 mov %rdx,-0x18(%rbp) 401145: 48 8b 45 f8 mov -0x8(%rbp),%rax 401149: f2 0f 10 08 movsd (%rax),%xmm1 40114d: 48 8b 45 f0 mov -0x10(%rbp),%rax 401151: f2 0f 10 00 movsd (%rax),%xmm0 401155: f2 0f 58 c1 addsd %xmm1,%xmm0 401159: 48 8b 45 e8 mov -0x18(%rbp),%rax 40115d: f2 0f 11 00 movsd %xmm0,(%rax) 401161: 48 8b 45 e8 mov -0x18(%rbp),%rax 401165: 48 83 c0 08 add $0x8,%rax 401169: 48 8b 55 f8 mov -0x8(%rbp),%rdx 40116d: 48 83 c2 08 add $0x8,%rdx 401171: f2 0f 10 0a movsd (%rdx),%xmm1 401175: 48 8b 55 f0 mov -0x10(%rbp),%rdx 401179: 48 83 c2 08 add $0x8,%rdx 40117d: f2 0f 10 02 movsd (%rdx),%xmm0 401181: f2 0f 58 c1 addsd %xmm1,%xmm0 401185: f2 0f 11 00 movsd %xmm0,(%rax) 401189: 48 8b 45 e8 mov -0x18(%rbp),%rax 40118d: 48 83 c0 10 add $0x10,%rax 401191: 48 8b 55 f8 mov -0x8(%rbp),%rdx 401195: 48 83 c2 10 add $0x10,%rdx 401199: f2 0f 10 0a movsd (%rdx),%xmm1 40119d: 48 8b 55 f0 mov -0x10(%rbp),%rdx 4011a1: 48 83 c2 10 add $0x10,%rdx 4011a5: f2 0f 10 02 movsd (%rdx),%xmm0 4011a9: f2 0f 58 c1 addsd %xmm1,%xmm0 4011ad: f2 0f 11 00 movsd %xmm0,(%rax) 4011b1: 90 nop 4011b2: 5d pop %rbp 4011b3: c3 retq ``` ``` 0000000000403350 <add_vector>: 403350: f2 0f 10 07 movsd (%rdi),%xmm0 403354: f2 0f 58 06 addsd (%rsi),%xmm0 403358: f2 0f 11 02 movsd %xmm0,(%rdx) 40335c: f2 0f 10 47 08 movsd 0x8(%rdi),%xmm0 403361: f2 0f 58 46 08 addsd 0x8(%rsi),%xmm0 403366: f2 0f 11 42 08 movsd %xmm0,0x8(%rdx) 40336b: f2 0f 10 47 10 movsd 0x10(%rdi),%xmm0 403370: f2 0f 58 46 10 addsd 0x10(%rsi),%xmm0 403375: f2 0f 11 42 10 movsd %xmm0,0x10(%rdx) 40337a: c3 retq 40337b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ``` - 所以直接複製 `-Ofast` 的組合語言就可以加速。 - 寫組語有點痛苦,有些區塊有 jump 不能直接套。 - 學到 FPU 跟 CPU 的組語不能混用… ``` static void normalize(double *v) { const double a = 1.0; asm("movsd (%rdi),%xmm1;" "movsd 0x8(%rdi),%xmm2;" "movsd 0x10(%rdi),%xmm3;" "mulsd %xmm1,%xmm1;" "mulsd %xmm2,%xmm2;" "mulsd %xmm3,%xmm3;" "addsd %xmm2,%xmm1;" "addsd %xmm3,%xmm1;" "sqrtsd %xmm1,%xmm1;" "divsd %xmm1,%xmm0;" "movsd (%rdi),%xmm1;" "movsd 0x8(%rdi),%xmm2;" "movsd 0x10(%rdi),%xmm3;" "mulsd %xmm0,%xmm1;" "mulsd %xmm0,%xmm2;" "mulsd %xmm0,%xmm3;" "movsd %xmm1, (%rdi);" "movsd %xmm2, 0x8(%rdi);" "movsd %xmm3, 0x10(%rdi);" ); } static double length(const double *v) { asm("movsd (%rdi),%xmm1;" "movsd 0x8(%rdi),%xmm0;" "mulsd %xmm1,%xmm1;" "mulsd %xmm0,%xmm0;" "movsd 0x10(%rdi),%xmm2;" "mulsd %xmm2,%xmm2;" "addsd %xmm1,%xmm0;" "addsd %xmm2,%xmm0;" "sqrtsd %xmm0,%xmm0;"); } static void add_vector(const double *a, const double *b, double *out) { asm("movsd (%rdi),%xmm0;" "addsd (%rsi),%xmm0;" "movsd %xmm0,(%rdx);" "movsd 0x8(%rdi),%xmm0;" "addsd 0x8(%rsi),%xmm0;" "movsd %xmm0,0x8(%rdx);" "movsd 0x10(%rdi),%xmm0;" "addsd 0x10(%rsi),%xmm0;" "movsd %xmm0,0x10(%rdx);"); } static void subtract_vector(const double *a, const double *b, double *out) { asm("movsd (%rdi),%xmm0;" "subsd (%rsi),%xmm0;" "movsd %xmm0,(%rdx);" "movsd 0x8(%rdi),%xmm0;" "subsd 0x8(%rsi),%xmm0;" "movsd %xmm0,0x8(%rdx);" "movsd 0x10(%rdi),%xmm0;" "subsd 0x10(%rsi),%xmm0;" "movsd %xmm0,0x10(%rdx);"); } static void multiply_vectors(const double *a, const double *b, double *out) { asm("movsd (%rdi),%xmm0;" "mulsd (%rsi),%xmm0;" "movsd %xmm0,(%rdx);" "movsd 0x8(%rdi),%xmm0;" "mulsd 0x8(%rsi),%xmm0;" "movsd %xmm0,0x8(%rdx);" "movsd 0x10(%rdi),%xmm0;" "mulsd 0x10(%rsi),%xmm0;" "movsd %xmm0,0x10(%rdx);"); } static void multiply_vector(const double *a, double b, double *out) { asm("movsd (%rdi),%xmm1;" "mulsd %xmm0,%xmm1;" "movsd %xmm1,(%rsi);" "movsd 0x8(%rdi),%xmm1;" "mulsd %xmm0,%xmm1;" "movsd %xmm1,0x8(%rsi);" "mulsd 0x10(%rdi),%xmm0;" "movsd %xmm0,0x10(%rsi);"); } static void cross_product(const double *v1, const double *v2, double *out) { asm("movsd 0x8(%rdi),%xmm0;" "movsd 0x10(%rdi),%xmm1;" "mulsd 0x10(%rsi),%xmm0;" "mulsd 0x8(%rsi),%xmm1;" "subsd %xmm1,%xmm0;" "movsd %xmm0,(%rdx);" "movsd 0x10(%rdi),%xmm0;" "movsd (%rdi),%xmm1;" "mulsd (%rsi),%xmm0;" "mulsd 0x10(%rsi),%xmm1;" "subsd %xmm1,%xmm0;" "movsd %xmm0,0x8(%rdx);" "movsd (%rdi),%xmm0;" "movsd 0x8(%rdi),%xmm1;" "mulsd 0x8(%rsi),%xmm0;" "mulsd (%rsi),%xmm1;" "subsd %xmm1,%xmm0;" "movsd %xmm0,0x10(%rdx);"); } static double dot_product(const double *v1, const double *v2) { asm("movsd (%rdi),%xmm0;" "movsd 0x8(%rdi),%xmm1;" "mulsd (%rsi),%xmm0;" "mulsd 0x8(%rsi),%xmm1;" "addsd %xmm1,%xmm0;" "movsd 0x10(%rdi),%xmm1;" "mulsd 0x10(%rsi),%xmm1;" "addsd %xmm1,%xmm0;"); } static void scalar_triple_product(const double *u, const double *v, const double *w, double *out) { asm("movsd 0x8(%rsi),%xmm2;" "movsd 0x10(%rsi),%xmm0;" "mulsd 0x10(%rdx),%xmm2;" "mulsd 0x8(%rdx),%xmm0;" "subsd %xmm0,%xmm2;" "movsd %xmm2,(%rcx);" "movsd 0x10(%rsi),%xmm1;" "movsd (%rsi),%xmm0;" "mulsd (%rdx),%xmm1;" "mulsd 0x10(%rdx),%xmm0;" "subsd %xmm0,%xmm1;" "movsd %xmm1,0x8(%rcx);" "movsd (%rsi),%xmm0;" "movsd 0x8(%rsi),%xmm3;" "mulsd 0x8(%rdx),%xmm0;" "mulsd (%rdx),%xmm3;" "subsd %xmm3,%xmm0;" "movsd %xmm0,0x10(%rcx);" "mulsd (%rdi),%xmm2;" "movsd %xmm2,(%rcx);" "mulsd 0x8(%rdi),%xmm1;" "movsd %xmm1,0x8(%rcx);" "mulsd 0x10(%rdi),%xmm0;" "movsd %xmm0,0x10(%rcx);"); } static double scalar_triple(const double *u, const double *v, const double *w) { asm("movsd 0x8(%rdx),%xmm3;" "movsd 0x10(%rdx),%xmm2;" "movsd 0x10(%rdi),%xmm4;" "movapd %xmm3,%xmm0;" "movsd 0x8(%rdi),%xmm1;" "movsd (%rdi),%xmm5;" "movapd %xmm2,%xmm7;" "movsd (%rdx),%xmm6;" "mulsd %xmm1,%xmm7;" "mulsd %xmm4,%xmm0;" "mulsd %xmm5,%xmm2;" "mulsd %xmm6,%xmm4;" "mulsd %xmm6,%xmm1;" "subsd %xmm7,%xmm0;" "mulsd %xmm5,%xmm3;" "subsd %xmm4,%xmm2;" "mulsd (%rsi),%xmm0;" "subsd %xmm3,%xmm1;" "mulsd 0x8(%rsi),%xmm2;" "mulsd 0x10(%rsi),%xmm1;" "addsd %xmm2,%xmm0;" "addsd %xmm1,%xmm0;"); } ``` - 執行時間 (9.132853611倍) ``` Execution time of raytracing() : 0.696097 sec ``` - 若要打敗 `-Ofast` 的 0.3 秒,可能還要重構 `raytracing.c` 了。