[software-pipelining](https://hackmd.io/s/ry7eqDEC)

# [software-pipelining](https://hackmd.io/s/ry7eqDEC) [github](https://github.com/diana0651/prefetcher) contributed by <`Diana Ho`> ###### tags: `d0651` `sys` ## 案例分析 ### 預期目標 - [ ] 學習計算機結構並且透過實驗來驗證所學 - [ ] 理解 prefetch 對 cache 的影響，從而設計實驗來釐清相關議題 - [ ] 論文閱讀和思考 ### 作業要求 - [ ] 閱讀 [在計算機裡頭實踐演算法](/s/HyKtIPN0) 提到的論文: "[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)" (務必事先閱讀論文，否則貿然看程式碼，只是陷入一片茫然!)，在 Linux/x86_64 (注意，要用 64-bit 系統，不能透過虛擬機器執行) 上編譯並執行 [prefetcher](https://github.com/embedded2015/prefetcher) * 說明 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` 之間的效能差異，以及 prefetcher 對 cache 的影響 - [ ] 在 github 上 fork [prefetcher](https://github.com/embedded2015/prefetcher)，嘗試用 AVX 進一步提昇效能 * 修改 `Makefile`，產生新的執行檔，分別對應於 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` (學習 [phonebook](s/S1RVdgza) 的做法) * 用 perf 分析 cache miss/hit * 學習 `perf stat` 的 raw counter 命令 * 參考 [Performance of SSE and AVX Instruction Sets](http://arxiv.org/pdf/1211.0820.pdf)，用 SSE/AVX intrinsic 來改寫程式碼 --- ## Prefetching >[Data Prefetch Mechanisms](http://www.cs.ucy.ac.cy/courses/EPL605/Fall2014Files/DataPrefetchSurvey.pdf) >[Caches and Prefetching](http://web.cecs.pdx.edu/~alaa/courses/ece587/spring2012/notes/prefetching.pdf) >[Prefetching](http://www.cc.gatech.edu/~hyesoon/fall11/lec_pref.pdf) 預先將資料從 memory 載入 cache，再進行 CPU 運算，可以降低 CPU 等待時間進而提升效率 - 時間載入至主記憶體的時機 - 不能太晚(cpu已經運算完,不再需要這些資料) - 不能太早(有可能在使用之前就被踢出cache) - 記憶體 prefetch intrinsic 提供參數可以選擇載入到哪個層級的快取記憶體 ### 背景知識 * SW prefetcher: * compiler中加強效能的algo. * intrinsics(內聯函數) e.g. SSE中的__mm_prefetch() * HW prefetcher: CPU 當中的 prefetcher * 用過去的存取紀錄作為prefetch的依據 * 論文中提到的 [GHB](http://www.eecg.toronto.edu/~steffan/carg/readings/ghb.pdf) (Global History Buffer) * 名詞定義 * Prefetch Hit: Prefetched line that was hit in the cache before being replaced (miss avoided) * Prefetch Miss: Prefetched line that was replaced before being accessed * Prefetch rate: Prefetches per instruction (or 1000 inst.) * Accuracy: Percentage of prefetch hits to all prefetches * Coverage: Percentage of misses avoided due to prefetching * 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses)) >>[參考概念](https://embedded2015.hackpad.com/-Homework-7-8-XZe3c94XjUh) >>[參考概念](https://hackmd.io/OwZgHAhmBMYgtBADAThfALCkAzeAjWAE3gFYMBTAYxwEYQN8rokg?both) >>[參考概念](https://hackmd.io/CYFgzAxgHFEOwFoJhI8BWARgqcAMAbAugIYgkBmATAZiNZkA?both) ## 閱讀論文 [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) ### 3. POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING * benefits of SW prefetching * large num of stream * short stream * Irregular memory access * cache locality hint * loop bound * negative impacts of SW prefetching * Increased Instruction Count * Static Insertion * Code Structure Change * synergistic effect of SW & HW prefetching * Handling Multiple Streams * Positive Training * antagonistic effect of SW & HW prefetching * Negative Training * Harmful Software Prefetching --- ## 程式測試 ```clike= $ ./main 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 192603 us sse: 215824 us naive: 365479 us ``` - 使用Perf分析cache miss ```clike= $ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main Performance counter stats for './main' (50 runs): 27,985,483 cache-misses # 88.670 % of all cache refs (33.29%) 31,806,730 cache-references (50.24%) 34,757,822 L1-dcache-load-misses # 4.34% of all L1-dcache hits (66.90%) 806,498,136 L1-dcache-loads (66.16%) 352,545,104 L1-dcache-stores (64.64%) 377,450 L1-icache-load-misses (33.11%) 0.731182840 seconds time elapsed ( +- 3.65% ) ``` ### 更改 Makefile ```clike= $ make cache-test perf stat -e cache-misses,cache-references,instructions,cycles ./naive_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 naive: 310485 us Performance counter stats for './naive_transpose': 18,959,975 cache-misses # 93.414 % of all cache refs 20,296,767 cache-references 1,572,649,544 instructions # 0.89 insns per cycle 1,761,874,413 cycles 0.511842859 seconds time elapsed perf stat -e cache-misses,cache-references,instructions,cycles ./sse_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse: 160067 us Performance counter stats for './sse_transpose': 6,657,239 cache-misses # 84.130 % of all cache refs 7,913,069 cache-references 1,421,957,448 instructions # 1.13 insns per cycle 1,253,144,039 cycles 0.333203789 seconds time elapsed perf stat -e cache-misses,cache-references,instructions,cycles ./sse_prefetch_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 85866 us Performance counter stats for './sse_prefetch_transpose': 6,592,423 cache-misses # 83.190 % of all cache refs 7,924,577 cache-references 1,407,620,222 instructions # 1.54 insns per cycle 914,875,974 cycles 0.283419155 seconds time elapsed ``` ## 效能差異 ### 矩陣轉置 #### TRANSPOSE_IMPL 記憶體空間是 Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置造成很高比例的 cache-misses. ### naive_transpose 最簡單的矩陣轉置想法最直覺的矩陣轉置，從矩陣的左上角第一個元素開始，計算每一個元素轉置之後的目的地，並且存入成為新矩陣。 ```clike= void naive_transpose(int *src, int *dst, int w, int h) { for(int x = 0; x < w; x++){ for(int y = 0; y < h; y++){ *(dst + x*h + y) = *(src + y*w + x); } } } ``` ### sse_transpose 使用了 Intel 處理器 [SIMD](https://en.wikipedia.org/wiki/SIMD) 的技術，每個 int element 為 32-bit，所以 4x4 矩陣的每一個 row 剛好可以打包成 128-bit，再透過高低位的 swap，達到轉置的效相較之下`naive transpose` 執行次數多且需要做 jump，增加 cache-misses 機會，`sse transpose` 更快。 #### 效能改進 * 一次將 4 筆數據放入 sse 暫存器中，一條指令處理 4 筆數據，比 4 筆數據 4 條指令處理快 * loop unrolling： * 執行 loop 循環的組合語言代碼執行次數會變少 * branch prediction miss 機率降低 > [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html) ```clike= void sse_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ``` :::info [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) 查詢函式的內部操作 - `__m128i _mm_unpacklo_epi32 (__m128i a, __m128i b)` 把 src1, src2 的 lo 的部分依序以 32-bit 的部分移進 dst 裡 - 其餘類推 ::: :::danger #### 理解實作 >>[參考概念](https://hackmd.io/KYQwbMBMkAxgtAVgOwE5LwCwmAY3qogIwBm8kywAzIhCCDLrkA==?both) 1. 首先，載入的資料分佈 ``` 0 32 64 96 128 +-------+-------+-------+-------+ I0 | I00 | I01 | I02 | I03 | +-------+-------+-------+-------+ I1 | I10 | I11 | I12 | I13 | +-------+-------+-------+-------+ I2 | I20 | I21 | I22 | I23 | +-------+-------+-------+-------+ I3 | I30 | I31 | I32 | I33 | +-------+-------+-------+-------+ ``` 2. 操作 `_mm_unpacklo_epi32()` 與 `_mm_unpackhi_epi32()` ``` 0 32 64 96 128 +-------+-------+-------+-------+ T0 | I00 | I10 | I01 | I11 | +-------+-------+-------+-------+ T1 | I20 | I30 | I21 | I31 | +-------+-------+-------+-------+ T2 | I02 | I12 | I03 | I13 | +-------+-------+-------+-------+ T3 | I22 | I32 | I23 | I33 | +-------+-------+-------+-------+ ``` 3. 使用 `_mm_unpacklo_epi64()` 與 `_mm_unpackhi_epi64()` 完成轉置 ``` 0 32 64 96 128 +-------+-------+-------+-------+ I0 | I00 | I10 | I20 | I30 | +-------+-------+-------+-------+ I1 | I01 | I11 | I21 | I31 | +-------+-------+-------+-------+ I2 | I02 | I12 | I22 | I32 | +-------+-------+-------+-------+ I3 | I03 | I13 | I23 | I33 | +-------+-------+-------+-------+ ``` ::: ### sse_prefetch_transpose > Loads one cache line of data from address p to a location closer to the processor. 比 `sse_transpose` 多使用了 4 次 [`_mm_prefetch`](https://msdn.microsoft.com/en-us/library/84szxsww(v=vs.90).aspx) 指令，prefetch 將用到的資料放在比較近的地方降低去 memory 拿資料的成本 > Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i. ```clike= _mm_prefetch(char * p , int i ) # 抓取 line of data #p : 從 address p 去讀取 #i : the type of prefetch operation: the constants _MM_HINT_T0,_MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA ``` #### 論文證明 [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) - benefits of SW prefetching - Short Streams (16\*4 byte) - Cache Locality Hint (預先 prefetch 到 L1cache ) - Loop Bounds (unroll 4x4 矩陣) #### 效能改進 - 降低 cache-misses - 降低執行時間(減少了去 memory 拿資料和 cache-misses 的時間) ```clike= void sse_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { #define PFDIST 8 _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); __m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ``` ### [SSE/AVX](http://arxiv.org/pdf/1211.0820.pdf) #### 以 AVX 改寫 Prefetcher 依照 SSE 版本的想法來擴展成 AVX 版本 >[Introduction to Intel® Advanced Vector Extensions](https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions) AVX 指令集是 256-bit，所以一次處理 8 個 byte，loop 一次加 8 1. 將依次將 8 行數組的元素載入到暫存器中 2. `_mm256_unpacklo/hi_epi32` 函式讀入兩個 256-bit 的數，將低/高 128-bit 以 32-bit 為單位交錯排列，`_mm256_unpacklo_epi64` 同理 - __m256i A = [ A0, A1, A2, A3, A4, A5, A6, A7 ]; - __m256i B = [ B0, B1, B2, B3, B4, B5, B6, B7 ]; - __m256i C = \_mm256_unpacklo_epi32(I0, I1) = [ A0, B0, A1, B1, A2, B2, A3, B3 ]; ```clike= void avx_transpose(int *src, int *dst, int w, int h) { for(int x=0; x < w ; x+= 8) { for(int y=0; y < h ; y+=8) { __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0 , I1); __m256i T1 = _mm256_unpackhi_epi32(I0 , I1); __m256i T2 = _mm256_unpacklo_epi32(I2 , I3); __m256i T3 = _mm256_unpackhi_epi32(I2 , I3); __m256i T4 = _mm256_unpacklo_epi32(I4 , I5); __m256i T5 = _mm256_unpackhi_epi32(I4 , I5); __m256i T6 = _mm256_unpacklo_epi32(I6 , I7); __m256i T7 = _mm256_unpackhi_epi32(I6 , I7); I0 = _mm256_unpacklo_epi64(T0, T2); I1 = _mm256_unpackhi_epi64(T0, T2); I2 = _mm256_unpacklo_epi64(T1, T3); I3 = _mm256_unpackhi_epi64(T1, T3); I4 = _mm256_unpacklo_epi64(T4, T6); I5 = _mm256_unpackhi_epi64(T4, T6); I6 = _mm256_unpacklo_epi64(T5, T7); I7 = _mm256_unpacklo_epi64(T5, T7); T0 = _mm256_permute2f128_si256(I0 , I4 , 0x20); T1 = _mm256_permute2f128_si256(I1 , I5 , 0x20); T2 = _mm256_permute2f128_si256(I2 , I6 , 0x20); T3 = _mm256_permute2f128_si256(I3 , I7 , 0x20); T4 = _mm256_permute2f128_si256(I0 , I4 , 0x31); T5 = _mm256_permute2f128_si256(I1 , I5 , 0x31); T6 = _mm256_permute2f128_si256(I2 , I6 , 0x31); T7 = _mm256_permute2f128_si256(I3 , I7 , 0x31); _mm256_storeu_si256((__m256i *) (dst + (x + 0) * h + y) , T0); _mm256_storeu_si256((__m256i *) (dst + (x + 1) * h + y) , T1); _mm256_storeu_si256((__m256i *) (dst + (x + 2) * h + y) , T2); _mm256_storeu_si256((__m256i *) (dst + (x + 3) * h + y) , T3); _mm256_storeu_si256((__m256i *) (dst + (x + 4) * h + y) , T4); _mm256_storeu_si256((__m256i *) (dst + (x + 5) * h + y) , T5); _mm256_storeu_si256((__m256i *) (dst + (x + 6) * h + y) , T6); _mm256_storeu_si256((__m256i *) (dst + (x + 7) * h + y) , T7); } } } ``` ```clike= void avx_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 8) { for (int y = 0; y < h; y += 8) { #define AVX_PFDIST 8 _mm_prefetch(src + (y + AVX_PFDIST + 0) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 1) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 2) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 3) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 4) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 5) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 6) * w + x, _MM_HINT_T1); _mm_prefetch(src + (y + AVX_PFDIST + 7) * w + x, _MM_HINT_T1); __m256i I0 = _mm256_loadu_si256((__m256i *)(src + (y + 0) * w + x)); __m256i I1 = _mm256_loadu_si256((__m256i *)(src + (y + 1) * w + x)); __m256i I2 = _mm256_loadu_si256((__m256i *)(src + (y + 2) * w + x)); __m256i I3 = _mm256_loadu_si256((__m256i *)(src + (y + 3) * w + x)); __m256i I4 = _mm256_loadu_si256((__m256i *)(src + (y + 4) * w + x)); __m256i I5 = _mm256_loadu_si256((__m256i *)(src + (y + 5) * w + x)); __m256i I6 = _mm256_loadu_si256((__m256i *)(src + (y + 6) * w + x)); __m256i I7 = _mm256_loadu_si256((__m256i *)(src + (y + 7) * w + x)); __m256i T0 = _mm256_unpacklo_epi32(I0, I1); __m256i T1 = _mm256_unpacklo_epi32(I2, I3); __m256i T2 = _mm256_unpacklo_epi32(I4, I5); __m256i T3 = _mm256_unpacklo_epi32(I6, I7); __m256i T4 = _mm256_unpackhi_epi32(I0, I1); __m256i T5 = _mm256_unpackhi_epi32(I2, I3); __m256i T6 = _mm256_unpackhi_epi32(I4, I5); __m256i T7 = _mm256_unpackhi_epi32(I6, I7); I0 = _mm256_unpacklo_epi64(T0, T1); I1 = _mm256_unpackhi_epi64(T0, T1); I2 = _mm256_unpacklo_epi64(T2, T3); I3 = _mm256_unpackhi_epi64(T2, T3); I4 = _mm256_unpacklo_epi64(T4, T5); I5 = _mm256_unpackhi_epi64(T4, T5); I6 = _mm256_unpacklo_epi64(T6, T7); I7 = _mm256_unpackhi_epi64(T6, T7); T0 = _mm256_permute2x128_si256(I0, I2, 0x20); T1 = _mm256_permute2x128_si256(I1, I3, 0x20); T2 = _mm256_permute2x128_si256(I4, I6, 0x20); T3 = _mm256_permute2x128_si256(I5, I7, 0x20); T4 = _mm256_permute2x128_si256(I0, I2, 0x31); T5 = _mm256_permute2x128_si256(I1, I3, 0x31); T6 = _mm256_permute2x128_si256(I4, I6, 0x31); T7 = _mm256_permute2x128_si256(I5, I7, 0x31); _mm256_storeu_si256((__m256i *)(dst + ((x + 0) * h) + y), T0); _mm256_storeu_si256((__m256i *)(dst + ((x + 1) * h) + y), T1); _mm256_storeu_si256((__m256i *)(dst + ((x + 2) * h) + y), T2); _mm256_storeu_si256((__m256i *)(dst + ((x + 3) * h) + y), T3); _mm256_storeu_si256((__m256i *)(dst + ((x + 4) * h) + y), T4); _mm256_storeu_si256((__m256i *)(dst + ((x + 5) * h) + y), T5); _mm256_storeu_si256((__m256i *)(dst + ((x + 6) * h) + y), T6); _mm256_storeu_si256((__m256i *)(dst + ((x + 7) * h) + y), T7); } } } ``` :::danger 理解實作 >>[參考概念](https://hackmd.io/KYQwbMBMkAxgtAVgOwE5LwCwmAY3qogIwBm8kywAzIhCCDLrkA==?both) ::: #### 程式實作 - 如果在 Makefile 裡試圖執行 sse_transpose 和 avx_transpose，會直接在產生目的檔時出現錯誤訊息 - 如果在 Makefile 裡不執行 sse_transpose，只執行 sse_transpose，不會直接在產生目的檔時出現錯誤訊息，但是會在將目的檔轉成編譯檔時出現`不合法的命令 (core dumped)` - ==還沒找到解決方法== ```clike= $./avx_transpose make: *** [test] 不合法的命令 (core dumped) $ make cache-test In file included from /usr/lib/gcc/x86_64-linux-gnu/4.9/include/immintrin.h:41:0, from main.c:9: /usr/lib/gcc/x86_64-linux-gnu/4.9/include/avxintrin.h:896:1: error: inlining failed in call to always_inline ‘_mm256_storeu_si256’: target specific option mismatch _mm256_storeu_si256 (__m256i *__P, __m256i __A) ^ In file included from main.c:18:0: impl.c:115:13: error: called from here _mm256_storeu_si256((__m256i *) (dst + (x + 7) * h + y) , T7); ^ make: *** [main] Error 1 ``` - cache-miss - 執行時間 >>[參考筆記](https://embedded2015.hackpad.com/Week8--VGN4PI1cUxh) >>[參考筆記](https://embedded2015.hackpad.com/-Homework-7-8-XZe3c94XjUh) --- <style> h2.part {color: red;} h3.part {color: green;} h4.part {color: blue;} h5.part {color: black;} </style>