Courses : C / C++ / Parallel Programming / GPU Programming / Others Tutorials Daily

# Courses : C \/ C++ \/ Parallel Programming \/ GPU Programming \/ Others Tutorials Daily To keep myself learning or reviewing something everyday, the following youtube channels provide tutorials\/courses and I can consume them daily. They will be seperated by channels and then series depends on the youtuber's original playlist or tags. All the note will be in the following format. * Under a youtube channel * Under a Series title * Note body > Reference\: > https://www.youtube.com/@NotesByNick > https://www.youtube.com/@cppweekly > https://www.youtube.com/@JacobSorber > https://www.youtube.com/@LearnLinuxTV > https://www.youtube.com/@MikeShah > https://www.youtube.com/@GPUMODE > https://www.youtube.com/@Tsoding # Jacob Sorber \: Programming with Threads > Reference\: > https://youtube.com/playlist?list=PL9IEJIKnBJjFZxuqyJ9JqVYmuFZHr7CFM *Notice that in this case we are using C not C++, thus I am basically learning POSIX thread.* ## How to create and join threads in C $pthreads$ **Windows and Linux have completely different ways to deal with threads inherently. Most of us don't do Windows programming, thus using POSIX threads is the default** Reference\: * C 語言 `pthread` 多執行緒平行化程式設計入門教學與範例 https://blog.gtwang.org/programming/pthread-multithreading-programming-in-c-tutorial/ ## How to pass arguments to and get results from threads. $`pthread_create`, `pthread_join`$ Unlike C++ STL thread, pthread function can return a value and the returned value will be catched by the pointer passed to the `pthread_join`. ## Safety and Speed Issues with Threads. $pthreads, mutex, locks$ It is basically reviewing the basic of concurrency programming. ## Debugging with Multiple Threads $gdb, pthreads$ Basically, `gdb` support multi-thread debuggin and we just need to specified the thread that we want. ## How to use the volatile keyword in C? `volatile` keyword mark a variable for compiler to not optimize it away. **Compiler should not make any assumption of its value.** ### Example Code ```cpp bool done = false; void *tfunc(void *arg) { sleep(1); done = true; return NULL; } int main() { pthread_t t1; pthread_create(&t1, NULL, tfunc, NULL); printf("Waiting...\n"); while(!done) {} printf("Ok, Moving on.\n"); } ``` The compiler might optimize `while(!done)` into `while(true)` since it might not see that `done` will be changed by other thread. Notice that different compiler will have different behavior. In order to make things **explicitly** correct, `volatile` should be used. ### Variation ```cpp volatile int done; volatile int* done; // the pointer is not volatile. The int that the pointer point to is volatile. int *volatile done; // opposite case as above volatile int* volatile done; // combine of above both. ``` # NotesByNick \: Parallel C++ > Reference\: > * YouTube Playlist > https://youtube.com/playlist?list=PLxNPSjHT5qvsGKsAhirvZn7W73pXhXpfv&si=GidNXjDcThucVYVC > * Github Repo > https://github.com/CoffeeBeforeArch/parallel_cpp/tree/main ## Parallel C++ \: Workload Distribution :::info :information_source: **Install oneTBB** https://uxlfoundation.github.io/oneTBB/GSG/installation.html#installation ::: :::success :bulb: **oneTBB vs TBB** * What is oneAPI and how does it compare to TBB? https://www.reddit.com/r/cpp/comments/py1tmr/comment/hevmbt6/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button ::: ### TBB Work Partition Strategy > Reference\: > * Partitioner Summary > https://uxlfoundation.github.io/oneTBB/main/tbb_userguide/Partitioner_Summary.html This video basically demonstrate load balancing issue. ## Parallel C++\: Static Partitioning Basically, demostrate two different ways to distribute a set of jobs and to choose between them require understanding of the jobs nature. * **by chunks** * **round robin** ## Parallel C++\: Dynamic Partitioning Instead of making thread working on predefined item subsets, we can use an `atomic<int>` to keep track of the job index that is assigned and when fetching new jobs just `atomic<int>.fetch_add(1)`. ```cpp std::atomic<int> index = 0; // in the thread function for (int i = index.fetch_add(1); i < num_work_items; i = index.fetch_add(1)) { ... } ``` :::success Recall that `fetch_add()` will return the current value and than increase by 1. ::: ## Parallel C++\: False Sharing TODO\: Add links to my other posts Check cache miss rate by using `perf stat` and `perf c2c`. $**c2c** means **cache to cache**$ ## Parallel C++\: Double Buffering **Problem**\: Producer consumer relationship with buffer. One is generating data inside the buffer and another one is using it. By using 2 buffers, generating thread can keep generating next set of data while the consumer thread using the data. ### Implementation Details * Using `std::swap()` on buffers, thus both threads can always working on the same variables. Generator generate data in `buffer_1` and after the consumer finish working on `buffer_2`, `std::swap` the `buffer_1` to `buffer_2`. Since this will be done on pointer rather than deep copy, the consumer waiting time is small. * Using `std::binary_semaphore` as signal for threads state. :::info :information_source: **C++20 多執行序的 semaphore** > Reference\: > https://viml.nchc.org.tw/cpp-multi-thread-semaphore/ * `std::binary_semaphore` can be used as the mutex if only one thread at a time allow to work. * Can somewhat replace `conditional_variable` to serve as signal between thread which is used in our case. ::: ## Parallel C++\: SpinLocks * SpinLocks is **busy waiting** and suitable for very small waiting time. * C++ does not provide spinlock in STL, but we can make one by using CAS with `std::atomic<bool>`. ```cpp /** * Spinlock implementation * @ref https://zhuanlan.zhihu.com/p/143804937 */ class SpinLock { public: SpinLock() : flag_(false) {} void lock() { bool expected = false; while (!flag_.compare_exchange_weak(expected, true)) { expected = false; } } void unlock() { flag_.store(false); } private: atomic<bool> flag_; }; ``` * `pthread` provide an spinlock implementation with `pthread_spinlock_t` and other functions to use it. * Since the thread will not be put into sleep, the performance on simple things, such as wait for an element pop in `std::list`, will be faster than using `std::mutex`. ## Parallel C++: Fairness There are no fairness built inside a spinlock and all the threads that are waiting will rush to get the lock. ### Ticket Spinlock > Github Source Code > https://github.com/CoffeeBeforeArch/parallel_cpp/blob/main/006_fairness/1_ticket.cpp ```cpp class Spinlock { private: // Lock is now two counters: // 1.) The latest place taken in line // 2.) Which number is currently being served std::atomic<std::uint16_t> line{0}; volatile std::uint16_t serving{0}; public: void lock() { // Get the latest place in line (and increment the value) auto place = line.fetch_add(1); // Wait until our number is "called" while (serving != place) _mm_pause(); } // Unlocking mechanism // Increment serving number to pass the lock // No need for an atomic! The thread with the lock is the only one that // accesses this variable! void unlock() { asm volatile("" : : : "memory"); // memory barrier serving = serving + 1; } }; ``` * Notice that `auto place = line.fetch_add(1);` basically make every thread grab a ticket with a wait number. * The number grabbed by each thread only depend on the order of the atomic operation is executed. The number will not necessary $most of the time$ match the thread involke order. :::info :information_source: **`_mm_pause` usage in `gcc` on Intel $CPU$** This require SSE2, which is supported by most of the modern CPU nowadays. https://stackoverflow.com/q/37063700 ::: :::info :information_source: **主記憶體屏障 $Memory Barrier$** https://zh.wikipedia.org/zh-tw/%E5%86%85%E5%AD%98%E5%B1%8F%E9%9A%9C ::: :::info :information_source: **從 CPU cache coherence 談 Linux spinlock 可擴展能力議題** https://hackmd.io/@sysprog/linux-spinlock-scalability TODO\: Notes ::: ## Parallel C++\: Concurrent Containers [TBB](https://github.com/uxlfoundation/oneTBB) provide concurrent data structures. ## Parallel C++ \: Blocking and Non-Blocking Algorithms Use `std::atomic` to sync will have a non-blocking algorithm same as for data structure. ```cpp while(true) { int desired; int expected = sink.load(); do { if (expected == iterations) return; // update the current desired value desired = expected + 1; // try CAS until success } while(!sink.compare_exchange_strong(expected, desired)); } ``` The above code snippet is how an CAS can be used. `sink.load()` is needed before we start `do while` loop to make sure that the CAS is success. If `sink` is not equal to `expected`, which is previously loaded, that means that the `sink` is changed when previous loaded. The CAS will load `sink` again and proceed to the next iteration of the `do while` loop. ## Parallel C++ \: Lock-Free and Wait-Free Algorithms :::info :information_source: **並行程式設計: Lock-Free Programming** https://hackmd.io/@sysprog/concurrency-lockfree This article describe the difference between lock-free and lock-less designs. ::: * **Non-blocking** \: If one thread is paused of put into sleep, all the other threads will still work without possiblely being blocked. * **Lock-free** \: For any given point in time, one of the thread will successfully work on CAS as the previous section. At least one thread makes progress. * **Wait-free** \: All threads make progress. :::success :bulb: **並行中的非阻塞演算法** https://lrita.github.io/2019/02/05/non-block-algorithm-with-concurrence/ ::: ## Parallel C++ \: Vectorization :::success :information_source: **Transform Funtions** * `tansform_reduce()` https://www.youtube.com/watch?v=9o3f92X_f_A The defualt behavior is time two vector element by element as transform part and sum up as reduce part. Therefore, it's basically a dot product. * `reduce()` https://www.geeksforgeeks.org/cpp/reduce-in-cpp-stl/ * `transform()` https://www.geeksforgeeks.org/cpp/transform-c-stl-perform-operation-elements/ ::: The above functions can be vectorized provide the correct execute pollicy and compiler flag. ## Parallel C++ \: SIMD Intrinsics **My Other Works** * https://hackmd.io/@Erebustsai/HkdXPx-rh * https://hackmd.io/@Erebustsai/B1ZGe5gsT ## Parallel C++ \: Unsafe Math Optimizations :::info https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-funsafe-math-optimizations ::: Note that `-funsafe-math-optimizations` will not be turn on in an `-O` options beside `-Ofast` since it will not obey IEEE or ANSI and might generate different result. However, the result is not necessary worse but just different. For the project of ML based application, computation, this floating point precision difference might not effect the final result. :::success * What Every Computer Scientist Should Know About Floating-Point Arithmetic https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html ::: ## Parallel C++ \: Thread Affinity > Reference\: > * C/C++ Linux/Unix 讓執行緒跑在指定 CPU 的方法 `pthread_setaffinity_np` > https://shengyu7697.github.io/cpp-pthread_setaffinity_np/ :::info :information_source: **`C11 std thread` with `pthread` Thread Affinity** * https://eli.thegreenplace.net/2016/c11-threads-affinity-and-hyperthreading/ * https://stackoverflow.com/a/57620568 * https://stackoverflow.com/a/35278467 * https://stackoverflow.com/a/77716318 Apparently, there are ways to get access to the `thread` implementation on the platform $mostly Linux with pthread$ and do it with native library calls $like pthread$ ::: If a set of threads share $work on$ the same data, we should set there thread affinity to the same CPU core. Intuiatively, the cache line will not be bouncing around the CPU cores but stay at one core and reduce the cache misses. :::success :bulb: **How to check cache behavior?** * `perf stat -d ./out` * `perf c2c record ./out` then `perf c2c report` for more detail ::: ## Parallel C++ \: Memory Ordering in x86 Processors ### My Other Posts * https://hackmd.io/d8us7WQSQkCiY_3g4l0GyQ#C-Memory-Model * https://hackmd.io/dku91ON4T5uVag0c3TsUtA#Memory-Model # Mike Shah \: Modern Cpp Series $Subtopic \: Safety$ ## Modern Cpp Series Ep. 232\: C++ Zero-Overhead Principle > Reference\: https://www.youtube.com/watch?v=wSVCnaLbQ2k :::info :information_source: **Zero-Overhead Principle** https://en.cppreference.com/w/cpp/language/Zero-overhead_principle.html ::: * You don't pay for what you don't use. * What you do use is just as efficient as what you could reasonably write by hand. ```cpp std::vector<int> v {0, 1, 2, 3, 4, 5, 6}; v[9] = 8; std::cout << v[7] << std::endl; // This might print if the capacity is big enough. int temp = v.at(7); // enforce boundary checking will throw an error. ``` # GPU Mode \: Lecture 64: Multi-GPU programming > Reference\: > https://www.youtube.com/watch?v=BgeFR4UfajQ ## Prerequisite\: Jacobi Solver :::info :information_source: **Jacobi Method** > Reference\: > https://www.youtube.com/watch?v=bR2SEe8W3Ig Basically, this method trying to solve the equation by iteratively substitute variables to finally reach the answer with a assumed initial guess value. ::: ```cpp for (int iy = 1; iy < ny - 1; iy++) for (int ix = 1; ix < nx - 1; ix++) a_new[iy * nx + ix] = 0.25 * (a[iy * nx + (ix + 1)] + a[iy * nx + ix - 1] + a[(iy - 1) * nx + ix] + a[(iy + 1) * nx + ix]); ``` ## Halo Exchange with Ghost Cells MPI Patterns ![image](https://hackmd.io/_uploads/rk7_1OATle.png) ```cpp for (int iy = iy_start; iy < iy_end; iy++) for (int ix = 1; ix < nx - 1; ix++) a_new[iy * nx + ix] = 0.25 * (a[iy * nx + (ix + 1)] + a[iy * nx + ix - 1] + a[(iy - 1) * nx + ix] + a[(iy + 1) * nx + ix]); ``` ```cpp MPI_Sendrecv(a_new + iy_start * nx, nx, MPI_FLOAT, top, 0, a_new + (iy_end * nx), nx, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Sendrecv(a_new + (iy_end - 1) * nx, nx, MPI_FLOAT, bottom, 0, a_new, nx, MPI_FLOAT, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); ``` > Reference\: > https://hackmd.io/@Erebustsai/SkyCU2g4n *The above reference refer to another post of mine, in which the ghost cells with halo exchange is explained.* Notice that, in CUDA-aware MPI, the unified virtual addressing between CPU and GPU make the process of using MPI buffer with GPU easier. We don't need to look into detail of how they do it, we just need to make sure that the pointer is point to the right memory address belong to the right device with `cudaPointerGetAttributes()`. :::success :bulb: Make sure that **Launching 1 Process per GPU**. For example`$[srun|mpirun|nvshmrun] [-np|-n] 4 ./myapp <args>`. ::: ## CUDA-aware MPI\: What\'s Under the Hood If we are using CUDA-aware MPI, we don't need to change the code and the MPI will automatically use the best option that we have. Such as if process in the same node and we have Nvlink available, we will have the fastest transition speed since we don't need to go through PCIe. If we don't have CUDA aware MPI, the code need to explicitly exchange the data from GPU memory to CPU memory and reversed on the other side. ## Overlapping Communication with Computation ```cpp launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, (iy_start + 1), nx, push_top_stream); launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_end - 1), iy_end, nx, push_buttom_stream); launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_start + 1), (iy_end - 1), nx, compute_stream); ... ``` Instead of using one stream and waiting for communication, we can launch, in this case, 3 streams to overlap the communication and computation. Therefore, in the code snippet, we launch one kernel for the first row which require communication with another process, one kernel for the last row which require communication as well and in between these two kernel, launch a kernel for rest of the rows. Most importantly, these kernel launch on different CUDA streams. ## Using NCCL TODO\: STOP HERE, going further for another time. # NVIDIA \: Accelerating Applications with Parallel Algorithms | CUDA C++ Class Part 1 > Reference\: > * Youtube Playlist > https://youtube.com/playlist?list=PL5B692fm6--vWLhYPqLcEu6RF3hXjEyJr&si=5HoK1W6SSBGkjfaV > * Github Repository > https://github.com/NVIDIA/accelerated-computing-hub/tree/main/tutorials/cuda-cpp ## Using Thurst Library ![image](https://hackmd.io/_uploads/BJr_A-Zxbx.png) > Reference\: https://github.com/NVIDIA/accelerated-computing-hub/blob/main/tutorials/cuda-cpp/slides/cuda_cpp_tutorial__r4.pdf ### Porting Algorithms to GPU ```cpp #include <thrust/execution_policy.h> #include <thrust/universal_vector.h> #include <thrust/transform.h> #include <cstdio> int main(int argc, char** argv) { float k = 0.5; float ambient_temp = 20; thrust::universal_vector<float> temp { 42.0f, 24.0f, 50.0f }; auto op = [=] __host__ __device__ (float temp) { return temp + k * (ambient_temp - temp); }; for (int step = 0; step < 3; step++) { thrust::transform(thrust::device, temp.begin(), temp.end(), temp.begin(), op); std::printf("%d %.2f %.2f %.2f\n", step, temp[0], temp[1], temp[2]); } } ``` * `thrust::universal_vector`\: refer to [this](https://developer.nvidia.com/blog/support-for-cuda-unified-memory-now-available-in-thrust/) * Execution Policy\: Thrust functions will have a parameter for execution policy, in which indicated the code to run on. e.g. `thurst::device`, `thrust::host` ## Fancy Iterators *Problem\: Compute max differences between two vectors* ### Naively Using a `thrust::transform` and a `thrust::reduce` can produce the correct result. However, the performance can be better by reducing the 3N reads and 1N write. As the second picture shows, we can reduce it to just 2N reads. ![image](https://hackmd.io/_uploads/HJWOo7-l-x.png) ![image](https://hackmd.io/_uploads/SyI_2mblWl.png) ```cpp float naive_max_change(const thrust::universal_vector<float>& a, const thrust::universal_vector<float>& b) { // allocate vector to store `a - b` thrust::universal_vector<float> unnecessarily_materialized_diff(a.size()); // compute products thrust::transform(thrust::device, a.begin(), a.end(), // first input sequence b.begin(), // second input sequence unnecessarily_materialized_diff.begin(), // result []__host__ __device__(float x, float y) { // transformation (abs diff) return abs(x - y); }); // compute max difference return thrust::reduce(thrust::device, unnecessarily_materialized_diff.begin(), unnecessarily_materialized_diff.end(), 0.0f, thrust::maximum<float>{}); } ``` :::success :bulb: **Transform Iterator \& Zip Iterator** Subscript `operator[]` can be overloaded and have different implementation to have different behaviors. * Simple Counting Iterator ```cpp struct counting_iterator { int operator[] (int i) { return i; } } ``` * Simple Transform Iterator ```cpp struct transform_iterator { int *a; int operator[] (int i) { return a[i] * 2; } } ``` * Simple Zip Iterator ```cpp struct zip_iterator { int *a; int *b; std::tuple<int, int> operator[] (int i) { return { a[i], b[i] }; } } ``` * Combining Input Iterators ```cpp struct transform_iterator { zip_iterator zip; int operator[] (int i) { auto [a, b] = zip[i]; return abs(a - b); } } std::array<int , 3> a { 0, 1, 2 }; std::array<int , 3> b { 5, 4, 2 }; zip_iterator zip { a.data(), b.data() }; transform_iterator it { zip }; std::printf("it[0] : %d\n", it[0]); ``` ::: ### Thrust Fancy Iterators Thrust provide implementation of the fancy iterators mentioned in the previous section. ```cpp thrust::make_transform_iterator( a.begin(), [] __host__ __device__ (int a) { return a * 2; } ); thrust::make_zip_iterator(a.begin(), b.begin()); thrust::make_counting_iterator(1); ``` ### With Iterators ```cpp float max_change(const thrust::universal_vector<float>& a, const thrust::universal_vector<float>& b) { auto zip = thrust::make_zip_iterator(a.begin(), b.begin()); auto transform = thrust::make_transform_iterator(zip, []__host__ __device__(thrust::tuple<float, float> t) { return abs(thrust::get<0>(t) - thrust::get<1>(t)); }); // compute max difference return thrust::reduce(thrust::device, transform, transform + a.size(), 0.0f, thrust::maximum<float>{}); } ``` By using fancy iterator, we can avoid creating arrays to store itermidiate results which will then used to produce the final result. Note that creating arrays or read and copying them entirely are very costly when doing GPU programming. ## Implementing Stencil Pattern ```cpp const float *in_ptr = thurst::raw_pointer_cast(in.data()); auto cell_indices = thrust::make_counting_iterator(0); thrust::transform( thrust::device, cell_indices, cell_indices + in.size(), out.begin(), [in_ptr, height, width] __host__ __device__ (int id) { int column = id % width; int row = id / width; if (row > 0 && column > 0 && row < height - 1 && column < width - 1) { float d2tdx2 = in_ptr[row * width + column - 1] - in_ptr[row * width + column] * 2 + in_ptr[row * width + column +1]; float d2tdy2 = in_ptr[(row - 1) * width + column] - in_ptr[row * width + column] * 2 + in_ptr[(row + 1) * width + column]; return in_ptr[row *width + column] + 0.2f * (d2tdx2 + d2tdy2); } else { return in_ptr[row * width + column]; // if out of boundary, do nothing. } } ) ``` ## Using `thurst::tabulate` `thurst::tabulate(first, last, op)` * `first` and `last` is for the output iterator. * This is essentially equivalent to transformation of couting iterator so there is no need for input iterators. * When there's a specialized algorithm, prefer it since it'll likely perform better. ```cpp const float *in_ptr = thurst::raw_pointer_cast(in.data()); thrust::tabulate( thrust::device, out.begin(), out.end(), [in_ptr, height, width] __host__ __device__ (int id) { int column = id % width; int row = id / width; if (row > 0 && column > 0 && row < height - 1 && column < width - 1) { float d2tdx2 = in_ptr[row * width + column - 1] - in_ptr[row * width + column] * 2 + in_ptr[row * width + column +1]; float d2tdy2 = in_ptr[(row - 1) * width + column] - in_ptr[row * width + column] * 2 + in_ptr[(row + 1) * width + column]; return in_ptr[row *width + column] + 0.2f * (d2tdx2 + d2tdy2); } else { return in_ptr[row * width + column]; // if out of boundary, do nothing. } } ) ``` ## Vocabulary Types in `libcu++` * Compound Types\: `cuda::std::pair`, `cuda::std::tuple` * Optional and Alternatives\: `cuda::std::optional` * Math\: `cuda::std::complex` * Synchronization\: `cuda::std::atomic` * CUDA Extensions\: `cuda::atomic` :::success :bulb: **Don't Repeat Yorself $DRY$** Create a function of generating `row` and `column`. ```cpp __host__ __device__ std::pair<int, int> row_col(int id, int width) { return cuda::std::make_pair(id / width, id % width); } ``` ::: ### Using `mdspan` :::info :information_source: **`std::mdspan`** * C++23 把連續資料當成多維陣列存取的容器：`mdspan` https://viml.nchc.org.tw/std-mdspan-in-cpp23/ ::: ```cpp cuda::std::array<int, 6> sd { 0, 1, 2, 3, 4, 5 }; cuda::std::mdspan md(sd.data(), 2, 3); // (row size, column size) std::printf("md(0, 0) = %d\n", md(0, 0)); // 0 (row index, column index) std::printf("md(1, 2) = %d\n", md(1, 2)); // 5 ``` ## Serial vs Parallel ![image](https://hackmd.io/_uploads/SJzfvyMxWg.png) ### With `tabulate` ```cpp thrust::tabulate( thurst::device, sums.begin(), sums.end(), [=] __host__ __device__ (int row_id) { float sum = 0; for (int col = 0; col < width; col++) { sum += tmep(row, col); } return sum; } ); ``` ![image](https://hackmd.io/_uploads/HkJVd1fl-e.png) As the above image, this shows that the `thrust::tabulate` and `thrust::transform` will work on the iterators and map one thread with one element. ### Using `thrust::reduce_by_key` ![image](https://hackmd.io/_uploads/HkGBKyGl-g.png) All values we want ot reduce are in the same row so we can reduce values in groups of consecutive keys that are equal. There will be more than one thread mapped to one element $a row$. ```cpp thurst::universal_vector<float> row_ids(height * width); thrust::tabulate( row_ids.begin(), row_ids.end(), [=] __host__ __device__ (int i) { return i / width; } ); // generate keys thrust::universal_vector<float> sums(height); thrust::reduce_by_key( thrust::device, row_ids.begin(), row_ids.end(), temp.begin(), thrust::make_discard_iterator(), sums.begin() ); ``` # NVIDIA \: Asynchrony and CUDA Streams | CUDA C++ Class Part 2 > Reference\: > https://www.youtube.com/watch?v=pyW9St8uM8w :::warning *Part 3, which is the final class will not included.* ::: ## Asynchronous API * `Thrust` is implemented via another CUDA core library, `CUB`. * `CUB` is asynchronous, which means non-blocking API calls. ```cpp // thrust thrust::tabulate(thrust::device, out.begin(), out.end(), compute); // cub auto cell_ids = thrust::make_counting_iterator(0); cub::DeviceTransform::Transform(cell_ids, out.begin(), num_cells, compute); ``` ## Pinned Memory ```cpp thrust::universal_host_pinned_vector<float> pinned_memory(size); ``` # Others *Some other tutorials that is not belong to any series.* ## Casting in C++ *Apparently, I never fully understand casting in C++, thus a section is created here.* > Reference\: > * \[C++\] 標準類型轉換：`static_cast`, `dynamic_cast`, `reinterpret_cast`, and `const_cast` > https://b8807053.pixnet.net/blog/post/339782825 > * 2.6.1 C++四種強制類型轉換http://www.uusystem.com/C++/C++%E4%B8%AD%E5%9F%BA%E6%9C%AC%E6%95%B0%E6%8D%AE%E7%B1%BB%E5%9E%8B/C++%E5%9B%9B%E7%A7%8D%E5%BC%BA%E5%88%B6%E7%B1%BB%E5%9E%8B%E8%BD%AC%E6%8D%A2/C++%E5%9B%9B%E7%A7%8D%E5%BC%BA%E5%88%B6%E7%B1%BB%E5%9E%8B%E8%BD%AC%E6%8D%A2.html * `static_cast`\: Basically, `static_cast` is just the replacement of the tranditional casting such as `(double)a`. * `dynamic_cast`\: This is used when a base pointer to an object type need to be cast to its derived type pointer since the based pointer can point to all the derived object. The casting will return `0` if the casting failed. For example, `derive2 d2; base* based = &d2; dynamic_cast<derive1>(d2);` will return `0` on the last statement. * `reinterpret_cast`\: Casting a pointer type to a different pointer type. This will not check anything. * `const_cast`\: this will cast **out** const. ```cpp const int g = 20; int *h = const_cast<int*>(&g);//去掉const常數const屬性 const int g = 20; int &h = const_cast<int &>(g);//去掉const引用const屬性 const char *g = "hello"; char *h = const_cast<char *>(g);//去掉const指針const屬性 ``` ## VisCircuit 電路筆記\: ESP32, Arduino, STM32 和 8051 是怎麼發明的？ > Reference\: > https://www.youtube.com/watch?v=t0zRpOQHeF4 ## The Chaotic State of GPU Programming > Reference\: > https://www.youtube.com/watch?v=9-DiGrnz8l8