蔡家倫

@SoulRRRRR

Joined on Jun 1, 2020

  • 測驗 $\alpha-1$ 此程式為 S-tree 的 implementation,一種 self-balancing BST,試著用 hint (類似 AVL tree 的 height) 估計子樹高度以平衡。 S-tree 在 插入節點 (st_insert) 後會對該節點做 st_update 以平衡(左旋或右旋)以及更新 hint 值 (max(left hint, right hint)+1)。只要該節點不為 root 且該節點的 hint 值和平衡前一樣,就會對該節點的 parent 繼續做 update。 移除節點 (st_remove) 則會依照要移除的節點有不同行為: 有 right child: 和 right child subtree 的 first node (successor) 交換 value 並移除,再對 successor 的 right child 做 update 沒有 right child,有 left child: 和 left child subtree 的 last node (predecessor) 交換 value 並移除,再對 predecessor 的 left child 做 update 都沒有: 直接移除並對其 parent 做 update
     Like  Bookmark
  • Q1 (5 points): Explain your implementation. How do you optimize the performance of convolution? For hostFE.c, I change global work size and local work size of clEnqueueNDRangeKernel to {(imageWidth+15)/16*16, (imageHeight+15)/16*16} and {16, 16}. For kernel.cl, I copied the serial version and change the index part. I also added const and __restrict__ to try to make compiler generate faster code.
     Like  Bookmark
  • Q1: What are the pros and cons of the three methods? Give an assumption about their performances. From CUDA 官網: "Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices as mentioned in Asynchronous Concurrent Execution." 用cudaHostAlloc的記憶體在cudaMemcpy時比較快,因此我認為 2 > 1。 3 的話如果block size跟一次處理的pixel沒有算好的話,可能會造成很多thread空等的時間。 I think speed: 2 > 1 > 3. Q2 How are the performances of the three methods? Plot a chart to show the differences among the three methods
     Like  Bookmark
  • Q1 (5 points) How do you control the number of MPI processes on each node? (3 points) Which functions do you use for retrieving the rank of an MPI process and the total number of processes? (2 points) I can control the number by -np <number>. I use MPI_Comm_rank to get the rank of MPI process and MPI_Comm_size to get total number of processes. Q2 (3 points) Why MPI_Send and MPI_Recv are called “blocking” communication? (2 points) Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it. (1 points)
     Like  Bookmark
  • Q1: In your write-up, produce a graph of speedup compared to the reference sequential implementation as a function of the number of threads used FOR VIEW 1. Is speedup linear in the number of threads used? In your writeup hypothesize why this is (or is not) the case? (You may also wish to produce a graph for VIEW 2 to help you come up with a good answer. View 1 View 2 速度不是隨著 Thread 增加線性變快的 我認為是因為 VIEW 1 中間的 row 圖案比較複雜,如果用空間分割的話 Thread 1 會要做比 Thread 0 跟 2 更多的工作量,因此造成整體加速程度比只有 2 Threads 來的差。而 VIEW 2 的圖案複雜度比較平均,因此可以看出加速為線性成長。 Q2: How do your measurements explain the speedup graph you previously created?
     Like  Bookmark
  • 0816034 蔡家倫 Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? The vector utilization decreases as VECTOR_WIDTH increases. I think it is because when VECTOR_WIDTH increases, it includes more numbers, and the parallel mode while (count > 0) code needs to run more useless times until the largest value goes to 0. eg. vector: [1 2 3 4] VECTOR_WIDTH 1: 1+2+3+4=10, 10/10 = 1 VECTOR_WIDTH 2: 2+4=6, 10/( 6 * 2 ) = 0.83.. VECTOR_WIDTH 4: 4, 10/(4 * 4) = 0.625
     Like  Bookmark
  • 95分 ------------------------------------ Your ID: 84 Your IP: 10.113.0.84 ---------------[Log]: Ping---------------- PING 10.113.0.84 (10.113.0.84): 56 data bytes 64 bytes from 10.113.0.84: seq=0 ttl=63 time=50.142 ms --- 10.113.0.84 ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss
     Like  Bookmark
  • https://open.kattis.com/problems/licenserenewal #include <bits/stdc++.h> using namespace std; #pragma GCC optimize("O2", "O3", "Ofast", "unroll-loops") #define fs first #define sc second #define pb push_back
     Like  Bookmark