David Tang

@davidzwei

Joined on Apr 16, 2020

  • Q1 How do you control the number of MPI processes on each node? 在 hosts中設置 slots ,hostname後面加上 slots = n,slots的數量表示該node可執行的processes數量。 For instance: pp2 slots=4 pp3 slots=4 Which functions do you use for retrieving the rank of an MPI process and the total number of processes? Retreve the rank of an MPI process:
     Like  Bookmark
  • Q1 Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. vector width = 2 CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 162728 Vector Utilization: 84.8%
     Like  Bookmark
  • Q1 In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. ❯ ./mandelbrot -t 1 [mandelbrot serial]: [668.115] ms Wrote image file mandelbrot-serial.ppm [mandelbrot thread]: [668.263] ms Wrote image file mandelbrot-thread.ppm (1.00x speedup from 1 threads)
     Like  Bookmark
  • Q1 What are the pros and cons of the three methods? Give an assumption about their performances. Method 1: malloc + cudaMallocPros:每個 thread 只需要處理一個 pixel,能善用 GPU 平行處理。 沒使用 page-locked memory ,不會對system performance 造成太大影響。 Cons: 各 thread 處理的計算量不同,導致比較快做完的 thread 需要等待比較慢做完的 thread。
     Like  Bookmark
  • Q1 Explain your implementation. How do you optimize the performance of convolution? ::: info在 hostFE.c 和 kernel.c 中的 *2 和 /2 都用 bit operation 來作加速。 在 hostFE.c 中 修改 filter 大小,和將 inputImage 的 float 型態轉換成 char 型態,減少傳輸量。 在 kernel.c 中修改 convolution 寫法,減少多餘的判斷式。 ::: 修改 filter 大小 我有把 filter 外圍多餘的 0 都去除掉,產生一個較小的 filter ,這樣可以在傳輸到 gpu 時傳較少 data,在作 convolution 時也可以減少一些不必要的計算。
     Like 1 Bookmark