C/Cpp and CUDA potpourri

[TOC] ## Best practice for the structure of a C project - Create a root directory for your project. - src: Store your source files (.c files) here. - include: Place your header files (.h files) here. - lib (optional): Store external libraries or dependencies. - bin (optional): Output directory for compiled executables or libraries. - test (optional): Directory for test files. ``` my_project/ ├── src │ ├── main.c │ ├── module1.c │ └── module2.c ├── include │ ├── main.h │ ├── module1.h │ └── module2.h ├── lib │ └── ... ├── bin │ └── ... └── test └── ... ``` - Header Files and Source Files: - One header file per source file: Generally, each .c file should have a corresponding .h file with the same name. - Header guards or #pragma once: Prevent multiple inclusions of the same header file. - Declarations in header files: Place function declarations (without the body), variable declarations, struct/union definitions, and macros in header files. Except: - Inline Functions: These are small functions that can be defined directly in header files. The compiler inserts the code of inline functions at the call site during compilation, potentially improving performance. However, overuse of inline functions can increase code size. (1) - Templated Classes and Functions: Templates are code structures that can work with different data types. They are typically defined entirely in header files because the compiler needs access to the template definition to generate code for specific data types during compilation. (2) - Definitions in source files: Implement functions, initialize global variables, and provide the main logic in source files. (3) - (1) Inline functions - Regular Function Call: ```! When a regular function is called, the program has to: - Push the function arguments onto the stack. - Jump to the memory location where the function code is stored. - Execute the function code. - Get the return value (if any) from the function. - Pop the arguments from the stack. - Jump back to the calling function. --> This process of jumping back and forth and managing the stack takes some time, which is called function call overhead. ``` - Inline Function: ```! - When an inline function is called, the compiler essentially copies and pastes the function's code directly into the calling function at the point where it is called. - This eliminates the need for the function call overhead, as the code is executed directly within the calling function. ``` Simply speaking, when a function is inlined, the compiler essentially creates a copy of the function's code and inserts it directly at each call site within the calling function --> Best practice: - Inline "small" + "frequently called" functions. - Fully implement inlined function in the header file. - Inline is just a hint for the complier, the decision to compiled or not is up to the compiler. - (2) Templated classes and functions in header files - both the declaration and the definition (implementation) of the template are placed within the header file. The reason is **If the template implementation is split between a header file and a source file, and the header is included in multiple source files, it can lead to multiple definitions of the same template**. ```cpp! // inside a header file #pragma once // Template function declaration and definition template <typename T> T max(T a, T b) { return (a > b) ? a : b; } ``` - (3) Questions for normal functions: - Declarations in file.h should be implemented in file.c. We must #include <file.h> in file.c because including the header file with the function prototype is essential for the compiler to verify that the function implementation matches the declaration and to ensure correct linking during the build process. - while building the project (compiling) if we have multiple implementations of the same function (for example, two source files include file.h and implement the same function in file.h) the compiling process will fail. - For example, we have func(), a function that has the declaration in file.h and the implementation in file.c, calling it from other source files is ok thanks to the linking process while compiling. - Pragma Once The #pragma once directive is a non-standard but widely supported preprocessor directive in C and C++. Its purpose is to prevent a header file from being included multiple times within a single translation unit (typically a single source file). - Multiple source files can include the same header file with #pragma once without any issues. Each source file will have its own independent inclusion of the header, and the #pragma once directive will ensure that the header's contents are only processed once within each file. - Preventing multiple inclusions within a file: The purpose of #pragma once (or traditional header guards) is to avoid problems that can arise when a header file is included multiple times within the same source file. This can happen due to nested inclusions or complex include paths. - Example: Let's say you have header1.h, which includes header2.h, and both headers are included in main.c. Without #pragma once or header guards, the contents of header2.h would be processed twice within main.c, potentially leading to errors due to duplicate definitions. - For most modern C and C++ projects, using #pragma once is a safe and efficient choice to prevent multiple header inclusions. - Macros - Macros are defined with #define directive. - Simply think of them like text-based find-and-replace operations. - Example: ```cpp #define MAX(x,y) (x > y ? x : y) int highest = MAX (var_1, var_2); // The compiler will replace it with // int highest = (var_1 > var_2 ? var_1 : var_2); ``` - In cpp, people will use const for value or inline function for function instead of macros for safer type-checking. - People still use macros because of **conditional compilation**, e.g, #ifdef, #ifndef - #ifdef M means if the macro M is defined. - #ifndef M means if the macro M is not defined. - Example: ```cpp! #define DEBUG #ifdef DEBUG #define DEBUG_BLOCK(expr) do { expr } while (0) #else #define DEBUG_BLOCK(...) do { } while (0) #endif ``` - In this example, the macro DEBUG acts as a switch to turn on or off the debug mode. By the way, this usage of macro is very common and useful for debugging purpose. - How to include the function from the header file into another c file if ? For example, the structure look like this: ``` ├── src │ ├── main.c │ ├── helper.c ├── include │ ├── main.h │ ├── helper.h ├── lib │ └── ... ├── bin │ └── ... └── test └── ... ``` ,where we have help() defined in helper.h and implemented in helper.c. Now if you want to include help() in main.c, follow these steps: - ```#include <include/helper.h>``` inside main.c - insert the declaration of help() inside main.c - when compiling, run ```gcc src/main.c src/helper.c -o my_program``` ## Build tools ### CMake vs Make - Make ```! - Simple Projects: For smaller projects with a straightforward build process and few dependencies, Make's simplicity and direct control can be advantageous. - Fine-Grained Control: If you need precise control over the build steps and prefer to write custom build rules, Make offers greater flexibility. - Learning Curve: Make has a gentler learning curve for beginners compared to CMake. - Limited Platforms: If you are only targeting a specific platform (e.g., Linux) and don't require portability, Make can be a sufficient choice. ``` - CMake ```! - Complex Projects: As your project grows in complexity, with multiple targets, dependencies, and libraries, CMake's ability to manage and automate the build process becomes invaluable. - Cross-Platform Development: CMake excels at generating build files for various build systems (Makefiles, Ninja, Visual Studio projects) and platforms (Linux, Windows, macOS), making it ideal for cross-platform development. - External Dependencies: CMake simplifies the process of finding and integrating external libraries and dependencies into your project. - Testing and Packaging: CMake provides built-in support for testing frameworks and packaging tools, streamlining the development workflow. - IDE Integration: CMake integrates well with various IDEs, allowing for easier project management and debugging. ``` ### Basic CMake usage 1. Basic Structure of a CMake Project and the Overall Explanation - Ensure you have a clear and organized structure for your project. A typical layout might look like this: ``` MyProject/ ├── CMakeLists.txt ├── src/ │ ├── main.cpp │ ├── module1.cpp │ ├── module2.cpp ├── include/ │ ├── module1.h │ ├── module2.h ├── lib/ │ ├── external_library/ │ │ ├── CMakeLists.txt │ │ ├── src/ │ │ ├── include/ ├── tests/ │ ├── CMakeLists.txt │ ├── test_module1.cpp │ ├── test_module2.cpp ├── CMakeModules/ │ ├── FindExternalLibrary.cmake └── build/ ``` - Explanation: - CMakeLists.txt: This is the main CMake configuration file. It is placed at the root of your project and includes directives for building the project, setting up compiler options, and defining targets. - src/: This directory contains the source files (.cpp or .c) for your project. - include/: This directory contains the header files (.h or .hpp) for your project. It provides the interface for your source files. - lib/: This directory is for external libraries that your project depends on. Each external library can have its own subdirectory with its own CMakeLists.txt, source files, and header files. - tests/: This directory contains your test files and a CMakeLists.txt to set up the testing framework and test targets. - CMakeModules/: This directory is optional and can contain custom CMake modules or find scripts (e.g., FindExternalLibrary.cmake) to locate and configure external dependencies. - build/: This directory is typically not included in the source repository. It is the default build directory where you generate your build files (Makefiles, Visual Studio solutions, etc.) and compiled binaries. - Example CMakeLists.txt (Root) ```cmake! cmake_minimum_required(VERSION 3.10) project(YourProject) # Set the C++ standard set(CMAKE_CXX_STANDARD 17) # Add subdirectories add_subdirectory(src) add_subdirectory(tests) # External libraries add_subdirectory(lib/external_library) # Enable testing enable_testing() ``` - Example CMakeLists.txt (src) ```cmake! # Create a library from the source files add_library(your_project_lib module1.cpp module2.cpp ) # Specify include directories for the library target_include_directories(your_project_lib PUBLIC ${CMAKE_SOURCE_DIR}/include ) # Create an executable from the main file add_executable(your_project_exe main.cpp ) # Link the library to the executable target_link_libraries(your_project_exe your_project_lib) ``` - Example CMakeLists.txt (tests) ```cmake! # Add a testing framework, e.g., Google Test find_package(GTest REQUIRED) include_directories(${GTEST_INCLUDE_DIRS}) # Create a test executable add_executable(test_module1 test_module1.cpp) target_link_libraries(test_module1 ${GTEST_LIBRARIES} pthread your_project_lib) add_executable(test_module2 test_module2.cpp) target_link_libraries(test_module2 ${GTEST_LIBRARIES} pthread your_project_lib) # Add tests add_test(NAME test_module1 COMMAND test_module1) add_test(NAME test_module2 COMMAND test_module2) ``` - Example CMakeLists.txt (lib/external_library) ```cmake! project(ExternalLibrary) # Create the external library add_library(external_library src/external_lib.cpp ) # Specify include directories for the external library target_include_directories(external_library PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include ) ``` *Below is the fully explanation for each part.* 2. Set Minimum CMake Version Always set a minimum required version of CMake to ensure compatibility. ```cmake! cmake_minimum_required(VERSION 3.15) ``` 3. Project Declaration Declare your project with appropriate metadata. ```cmake! project(MyProject VERSION 1.0.0 LANGUAGES CXX) ``` 4. Set C++ Standard Specify the C++ standard your project requires. ```cmake! set(CMAKE_CXX_STANDARD 17) set(CMAKE_CXX_STANDARD_REQUIRED ON) ``` 5. Organize Code with Subdirectories Use add_subdirectory to include subdirectories, such as src and tests. ```cmake add_subdirectory(src) add_subdirectory(tests) ``` 6. Use Targets for Libraries and Executables Create targets for your libraries and executables. This helps in managing dependencies and properties. (In CMake a target can be a library or an executable) - add_library: This command creates a library target from specified source files. ```cmake! add_library(MyLibrary src/my_library.cpp) # MyLibrary is the name of the library target. # src/my_library.cpp is the source file for the library. ``` - add_executable: This command creates an executable target from specified source files. ```cmake! add_executable(MyExecutable src/main.cpp) # MyExecutable is the name of the executable target. # src/main.cpp is the source file for the executable. ``` - target_include_directories: This command specifies include directories for a target. It can take PUBLIC, PRIVATE, or INTERFACE keywords to control the visibility of the include directories. ```cmake! target_include_directories(MyLibrary PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include) # MyLibrary is the target to which the include directories are being added. # PUBLIC means the include directories are used for both the target itself and for any targets that link against it. # ${CMAKE_CURRENT_SOURCE_DIR}/include is the path to include. # By including /include, you can do #include <abc.h> instead of #include <include/abc.h> ``` - target_link_libraries: This command links libraries to a target. It can take PUBLIC, PRIVATE, or INTERFACE keywords to control the visibility of the linkage. ```cmake! target_link_libraries(MyExecutable PRIVATE MyLibrary) # MyExecutable is the target to which the library is being linked. # PRIVATE means the linkage is only relevant for the target itself, not for any targets that might link against it. # MyLibrary is the library to link against. ``` - **Note:** - for a target, if we link it with a lib, all #include from that lib are also included inside that target. Simirlaly, all #include from included directories are included in the target as well. That's why we need #pragma once to prevent multiple inclusions. - you still need to include the header file to use the functions from the library in your code. Linking a library to your target provides the compiled code of the library to your program, but it does not provide the declarations of the functions and classes, which are necessary for the compiler to know how to call them. 7. Use Modern CMake Commands Prefer modern CMake commands and properties, such as target_include_directories, target_link_libraries, and target_compile_features. 8. Dependencies and External Libraries Use find_package to locate external libraries and link them to your targets. ```cmake find_package(MPI REQUIRED) target_link_libraries(MyApp PRIVATE MPI::MPI_CXX) ``` 9. Handling Compiler Flags Set compiler-specific flags using target_compile_options. ```cmake if (CMAKE_CXX_COMPILER_ID STREQUAL "GNU") target_compile_options(MyApp PRIVATE -Wall -Wextra -pedantic) endif() ``` 10. Use Interface Libraries for Header-Only Libraries For header-only libraries, use INTERFACE targets. ```cmake add_library(HeaderOnlyLib INTERFACE) target_include_directories(HeaderOnlyLib INTERFACE include) ``` 11. Testing Integrate testing with CMake using CTest. ```cmake enable_testing() add_executable(TestApp tests/test_main.cpp) target_link_libraries(TestApp PRIVATE MyLibrary) add_test(NAME MyTest COMMAND TestApp) ``` 12. Installation Rules Define installation rules for your targets and headers. ```cmake install(TARGETS MyApp MyLibrary RUNTIME DESTINATION bin LIBRARY DESTINATION lib ARCHIVE DESTINATION lib) install(DIRECTORY include/ DESTINATION include) ``` 13. Use CMake Presets Leverage CMake presets for easier configuration management. Example CMakePresets.json: ```json! { "version": 3, "cmakeMinimumRequired": { "major": 3, "minor": 19, "patch": 0 }, "presets": [ { "name": "default", "hidden": true, "generator": "Ninja", "binaryDir": "${sourceDir}/build", "cacheVariables": { "CMAKE_BUILD_TYPE": "Release" } }, { "name": "ci", "inherits": "default", "description": "Configuration for CI builds", "cacheVariables": { "CMAKE_BUILD_TYPE": "Debug" } } ] } ``` 14. Document Your CMakeLists Comment your CMakeLists.txt files to explain non-obvious configurations and decisions. ```cmake # Add the main application executable add_executable(MyApp src/main.cpp) # Link the main application with MyLibrary target_link_libraries(MyApp PRIVATE MyLibrary) ``` ## C/Cpp ### Substitution Failure Is Not An Error (SFINAE) in C++ #### std::enable_if and std:enable_if_t In C++, the class signature of std::enable_if is as follows. ```cpp! template< bool B, class T = void > struct enable_if; ``` If B is true, ```std::enable_if``` has a public member typedef type, equal to T; otherwise, there is no member typedef. (Note that outside the struct, to access we need ```std::enable_if<condition, T>::type```) ```std::enable_if``` could be implemented as follows. ```cpp! template<bool B, class T = void> struct enable_if {}; template<class T> struct enable_if<true, T> { typedef T type; }; ``` This means, whenever the implementation tries to access ```enable_if<B,T>::type``` when B = false, the compiler will raise compilation error, as the object member type is not defined. Since C++ 14, there is an additional helper shortcut ````std::enable_if_t```` defined in the C++ standard library. ```cpp! template< bool B, class T = void > using enable_if_t = typename enable_if<B,T>::type; ``` #### ```typename``` keyword Usage: ```T::type``` is a dependent name because it depends on the template parameter T. The compiler does not know if``` T::type ```is a type or a value, which leads to a compilation error. --> The ```typename``` keyword is used to tell the compiler that a dependent name is a type. This is crucial for correct template code parsing. #### Two SFINAE templates by Lei Mao These are two templates we'll use to create SFINAE templates ```cpp! // Non-Type Template Parameters template <int = 10> class C { int v{0}; }; // Type Template Parameters template <typename = int> class C { int v{0}; }; ``` ```cpp! #include <iostream> #include <type_traits> // C++11, Type Template Parameters template <typename T, typename = std::enable_if_t<std::is_integral<T>::value, float>> void foo() { std::cout << "T could only be int" << std::endl; } // C++, Type Template Parameters template <typename T, typename = typename std::enable_if<std::is_integral<T>::value, float>> void foo() { std::cout << "T could only be int" << std::endl; } // C++11, Non-Type Template Parameters template <typename T, std::enable_if_t<std::is_integral<T>::value, bool> = true> void foo() { std::cout << "T could only be int" << std::endl; } // C++, Non-Type Template Parameters template <typename T, typename std::enable_if<std::is_integral<T>::value, bool>::type = true> void foo() { std::cout << "T could only be int" << std::endl; } int main() { foo<int>(); // foo<float>(); // Compilation error. bar<int>(); // bar<float>(); // Compilation error. } ``` ### Static Allocation vs. Dynamic Allocation - Static Allocation - Compile-Time Determined: Size and memory allocation are determined at compile time. - Memory Location: Typically allocated on the stack. - Fixed Size: Cannot change size during program execution. - Example: int array[10]; // 10 is known at compiled-time - Dynamic Allocation - Run-Time Determined: Size and memory allocation are determined at runtime. - Memory Location: Allocated on the heap. - Flexible Size: Can allocate, resize, and deallocate memory during program execution. - Example: int* array = new int[size]; // size is known at runtime - VLA: Sometime you see something like this: ```cpp! #include <iostream> int main() { const int size = 5; // const is initialized at runtime int array[size]; return 0; } ``` It still works because of VLA, which is a feature of the compiler, not the standard of cpp. ### `constexpr`: - To initialize both objects and functions. - Similar to const but at compiled time. - A vital thing is that constexpr functions have two faces: - When called with constexpr arguments, it produces a constexpr result. - Else it acts as a normal function. - Static allocation with constexpr is ok while with const is not. ### `static` 1. `static` in functions: - Keeps variable value between function calls - Variable is initialized only once - Lifetime: entire program run - Scope: limited to function 2. `static` in classes/structs: - Shared by all instances of the class - Only one copy exists, regardless of object count - Can be accessed without creating an object - Must be defined outside the class declaration In both cases, `static` elements are associated with the class or function itself, not with specific instances or calls. ### Definition vs Declaration - **Declaration**: Introduces a name and type without allocating storage. It indicates the existence of a variable or function. - Example: `extern int x;` (for variables) - Example: `int add(int a, int b);` (for functions) - **Definition**: Allocates storage for a variable or provides the body of a function. - Example: `int x = 10;` (for variables) - Example: `int add(int a, int b) { return a + b; }` (for functions) #### `extern` Keyword - **Purpose**: Declares a variable or function defined in another file. - **Usage**: - Variables: `extern int count;` (declares `count` defined elsewhere) - Functions: Typically implicit, no need for `extern`. #### Key Points - Declarations specify type and indicate existence. - Definitions allocate storage or provide implementation. - `extern` links to external definitions. ### C++11 - C++ ```auto``` Keyword: The ```auto``` keyword allows C++11 to automatically deduce the type of the variable from its initializer. ```cpp! auto x = 42; // x is automatically deduced to be of type int // auto should be used for iterators and long type names. // auto shouldn't be used when we need clarity. ``` - Simple for loop: Provides a simpler syntax to iterate over containers. ```cpp! std::vector<int> vec = {1, 2, 3, 4, 5}; for (auto& v : vec) { std::cout << v << std::endl; } ``` - the ```auto``` also allows automatically deducing the function return types. ```cpp! auto add(int a, int b) { return a + b; // the return type is deduced to be int } ``` - C++ ```decltype```: The ```decltype``` keyword is used to query the type of an expression without actually evaluating it. This can be useful for template metaprogramming, auto return types, and more. ```cpp! int x = 5; decltype(x) y = 10; // y is deduced to be int ``` - combining with auto: stronger auto? ```cpp! int x = 5; int& r = x; auto a = r; // a is int (reference is not preserved) decltype(auto) b = r; // b is int& (reference is preserved) ``` - the trailing syntax to define a function: - the trailing return type syntax in C++ is a way to specify the return type of a function after the parameter list, using the -> operator followed by the type. ```cpp! auto functionName(parameterList) -> returnType { // function body } ``` - combine with the ```decltype``` keyword: used with decltype to deduce the return type from expressions ```cpp! template<typename T1, typename T2> auto add(T1 a, T2 b) -> decltype(a + b) { return a + b; } ``` - Smart Pointers - std::unique_ptr - Single Ownership: A std::unique_ptr owns the object it points to exclusively. No other std::unique_ptr can point to the same object. - Use Case: Suitable for cases where an object has a clear single ownership, and you want to ensure it is automatically destroyed when it goes out of scope. - std::shared_ptr - Shared Ownership: A std::shared_ptr can share ownership of an object with multiple std::shared_ptrs. The object is destroyed when the last std::shared_ptr owning it is destroyed. - Use Case: Suitable when multiple parts of the program need to share ownership of an object. - nullptr: Introduces nullptr as a type-safe null pointer, replacing the old NULL macro. ```cpp! int* p = nullptr; ``` - lambda expression: A lambda expression in C++ is an anonymous function defined using the [] syntax. It can capture variables from the surrounding scope and can be used wherever a function object is needed. - basic use case: ```cpp! auto add = [](int a, int b) { return a + b; }; std::cout << add(2, 3); // Output: 5 ``` - capture valuables, references: - [x]: Capture x by value. - [&x]: Capture x by reference. - [=]: Capture all variables by value. - [&]: Capture all variables by reference. - [=, &y]: Capture all variables by value, but capture y by reference. - [&, x]: Capture all variables by reference, but capture x by value. ### misc - include "" vs. include <>: It is kinda a convention where include <> is for standard library while include "" is for user-defined or local headers. - size_t: an unsigned integer type suitable for representing the size of any non-negative quantity, which includes dimensions and indices. - __ FILE __ and __ LINE __ : the __ FILE __ macro represents the name of the source file that calls the macro or function. Similarly, the __ LINE __ macro represents the line number in the source file where the macro or function is called - const T* vs. T* const: - const T *: pointer to a fix T value - T* const: a fix pointer to a T value - const char* const: a fix pointer to a fix string - Deferencing pointer: - Dereferencing on the right side of =: You are reading the value from the memory location. - Dereferencing on the left side of =: You are writing a value to the memory location. - the ```typename``` keyword or ```class``` keyword in template definition: - When defining a function, traditionally, you need to specify the exact types of the parameters. This can be limiting if you want your function to work with multiple data types. --> ```typename```, ```class```, for example: ```cpp! template <class ProblemShape, class CtaTiler, class TA, class AStride, class ASmemLayout, class AThreadLayout, class TB, class BStride, class BSmemLayout, class BThreadLayout, class TC, class CStride, class CSmemLayout, class CThreadLayout, class Alpha, class Beta> __global__ static void gemm_device(ProblemShape shape_MNK, CtaTiler cta_tiler, TA const* A, AStride dA, ASmemLayout sA_layout, AThreadLayout tA, TB const* B, BStride dB, BSmemLayout sB_layout, BThreadLayout tB, TC* C, CStride dC, CSmemLayout sC_layout, CThreadLayout tC, Alpha alpha, Beta beta) { using namespace cute; int M = shape_MNK.M; int N = shape_MNK.N; int K = shape_MNK.K; // Implementation using M, N, K dimensions and alpha, beta scalars } // The template parameters like Alpha and Beta allow you to pass in any type for these scalar values. // Additionally, the Alpha, Beta will be deduced from alpha and beta if not specified in the gemm_device call. ``` ### Template instantiation: There are 2 ways to give implementation for a template: - Include Implementation in Header, right after the declaration. - If you only have declartion of the template in the header file, and have the implementation in the other file, need Explicit Instantiation. - Template Definition ```cpp! template<typename T> void printType(T param) { std::cout << "General type\n"; } // Explicit Specialization template<> void printType<int>(int param) { std::cout << "Specialized for int\n"; } // Explicit Instantiation template void printType<double>(double param); int main() { printType(42); // Calls the specialized version for int printType(3.14); // Calls the general version for double printType("Hello"); // Calls the general version for const char* return 0; } // Takeaways // Template Definition provides a general implementation. // Explicit Specialization overrides this for specific types like int. // Explicit Instantiation ensures the compiler generates code for specific types, like double, even if not used directly in the program. ``` #### Redeclaration bug of Shared Memory: 1. Template Instantiation: - Creates concrete version of a template for specific types (code generation for that type) - Example: `template void func<int>(int);` 2. Template Specialization: - Custom implementation for specific types - Example: ```cpp template <> void func<double>(double) { /* specialized code */ } ``` 3. Redeclaration of shared memory: - Issue: Multiple `extern __shared__` declarations in same compilation unit. - Caused by multiple template instantiations in one file. 4. Solutions: - Separate instantiations: Put each instantiation in different .cu files - Use specialization: Provide specialized versions for each type Example of separation: ```cpp // file1.cu template void func<int>(int); // file2.cu template void func<float>(float); ``` Example of specialization: ```cpp template <> void func<int>(int) { /* specialized for int */ } ``` Key takeaway: Manage template instantiations carefully to avoid shared memory redeclaration issues. ## CUDA ### malloc vs. cudaMallocHost 1. Memory Location: - malloc: Allocates pageable memory on the host. - cudaMallocHost: Allocates pinned memory on the host. 2. Performance: - malloc: Slower for data transfers between host and GPU because the memory can be paged. - cudaMallocHost: Faster for data transfers between host and GPU due to pinned memory which is not pageable. 3. Use Case: - malloc: General-purpose memory allocation for host applications. - cudaMallocHost: Optimized memory allocation for CUDA applications to improve data transfer performance between host and GPU. ### Illegal things inside a conditional statement - shared memory allocation - ```__syncthreads``` ```cpp! __global__ void myKernel(int condition) { if (condition) { __shared__ float sharedArray[256]; // This is illegal // Use shared memory sharedArray[threadIdx.x] = threadIdx.x; } else { // Do something else } } ``` ### Synchronization - cudaDeviceSynchronization: - Blocks the Host: When you call cudaDeviceSynchronize from your host code, the CPU will halt execution until all previously issued CUDA operations on the device have finished. - Acts on All Streams: This function doesn't just synchronize a specific stream or kernel. It ensures that all pending work across all streams on the device has been completed. ```cpp! // Launch a CUDA kernel myKernel<<<blocks, threads>>>(...); // Synchronize to ensure kernel execution has completed cudaDeviceSynchronize(); // Copy results back to host CUDA_CHECK(cudaMemcpy(host_data, device_data, size, cudaMemcpyDeviceToHost)); // Continue with host processing using the correct data from the device ``` ### Timing - cudaEventCreate: Creates an event object, which is essentially a marker in the GPU's timeline. ```cpp! cudaEvent_t start, stop; CUDA_CHECK(cudaEventCreate(&start)); CUDA_CHECK(cudaEventCreate(&stop)); ``` - cudaEventRecord: Records an event at a specific point in the execution stream. This is like placing a timestamp on the GPU timeline. ```cpp! cudaEventRecord(start, 0); // Record the "start" event // ... CUDA kernel launch and other GPU operations cudaEventRecord(stop, 0); // Record the "stop" event ``` - cudaEventSynchorize: Waits for a specific event to complete. This is crucial for accurate timing measurements. ```cpp! cudaEventSynchronize(stop); // Wait until the "stop" event is reached ``` - cudaEventElapsedTime: Use this function to get the time difference between two recorded events. ```cpp! float milliseconds = 0; CUDA_CHECK(cudaEventElapsedTime(&milliseconds, start, stop)); printf("Time for kernel execution: %.3f ms\n", milliseconds); ``` ### Static Shared Memory vs. Dynamic Shared Memory - Static Shared Memory: - Static shared memory is **declared** and **defined** at compile time within the kernel using the __shared__ keyword. This memory is allocated and managed by the CUDA runtime based on the declaration size. - In this case, the size of the shared memory is fixed and determined at compile time. You do not need to specify the third parameter in the kernel launch, as the CUDA runtime automatically handles it. ```cpp! __global__ void myKernel(...) { __shared__ float staticSharedMem[256]; // Static shared memory // Kernel code } ``` - Dynamic Shared Memory - Dynamic shared memory is allocated at runtime and its size is determined when launching the kernel. It is **declared** with an **extern** keyword within the kernel. (It's not defined here, the definition is at the kernel launch) ```cpp! __global__ void myKernel(...) { extern __shared__ float dynamicSharedMem[]; // Dynamic shared memory // Kernel code } ``` - The 3rd parameter of the kernel launch is the size of the dynamic shared memory. ```cpp! const int sram_size = (3 * Bc * head_dim * sizeof(float)) + (Bc * Br * sizeof(float)); flash_attention_1_kernel<<<grid, block, sram_size>>>(d_Q, d_K, d_V, d_O, d_l, d_m, Tr, Tc, Br, Bc, num_heads, seq_len, head_dim, softmax_scale ); // here it is the sram_size ``` - **warning**: If you use dynamic shared memory without specifying the third parameter (shared memory size) in the kernel launch, the kernel will not allocate any shared memory dynamically. This can lead to undefined behavior or access violations if the kernel tries to access the unallocated dynamic shared memory. ### pragma unroll - ```#pragma unroll``` is a compiler directive used in C and C++, CUDA programming to hint to the compiler that it should unroll loops, which can optimize performance by reducing the overhead of the loop control code. - ```#pragma unroll``` is a compiler optimization that can, for example, replace a piece of code like ```cpp! for ( int i = 0; i < 5; i++ ) b[i] = i; ``` with ```cpp! b[0] = 0; b[1] = 1; b[2] = 2; b[3] = 3; b[4] = 4; ``` - Practice: put it everywhere we use for, and tune it as a post-optimization tool. - Warning: For ```#pragma unroll``` to unroll a loop effectively, the number of iterations must be known at compile time. If the loop count is only known at runtime, the compiler cannot perform this optimization. ### Global Memory Coalescing - How are blocks, threads organized? - **The organization of blocks inside a grid is the same with the organization of threads inside a block:** ```! // blockId blockId = blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y*blockIdx.z) = blockIdx.x + gridDim.x*blockIdx.y + (gridDim.x*gridDim.y) * blockIdx.x // threadId threadId = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y*threadIdx.z) = threadIdx.x + blockDim.x*threadIdx.y + (blockDim.x*blockDim.y) * threadIdx.z // folding into 1D globalThreadId globalThreadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + threadId ``` [CUDA Thread Indexing Cheatsheet](https://www.eecs.umich.edu/courses/eecs498-APP/resources/materials/CUDA-Thread-Indexing-Cheatsheet.pdf) - Which threads are called consecutive? consecutive in globalThreadId - two are in the same block and consecutive in threadId - one is the last thread of the (n-1)th block and one is the first thread of the (n)th block ![image](https://hackmd.io/_uploads/Bk3jn7mVR.png) - Memory coalescing inside a warp: In the first illustration, 1 warp has 8 threads. Each thread loads a 32bit value (4B value) --> loading 8 values requires 8x4B = 32B ~ a 32B load instruction. However, this is the ideal case where one 32B is enough because the accesses are coalesced. If they are not, we may need more than a 32B instruction like the illustration 2, we need 2. (Please ignore the script under the visualization where they use a 32B instruction for 4 consecutive accesses instead of 8?). ![image](https://hackmd.io/_uploads/rkstq7QEA.png) ![image](https://hackmd.io/_uploads/S10Ro7mEA.png) - Note that accesses are not restrictly adjacent to enable memory coalescing. As long as they access consecutive data, in any order, the instructions can be merged: ![image](https://hackmd.io/_uploads/r1ukuY0rR.png) - Practically, GPUs support 32B, 64B, 128B instructions. Therefore, a warp with 32 threads will ideally need only one 32x4B=128B instruction to do the load. Otherwise, they will do as many 32B loads as possible to load all the demanding data. - Within-warp broadcast: threads within the same warp access the same values enable speed up. ### ```__restrict__``` keyword ```__restrict__``` before the pointer (as a parameter of a function) tells the compiler that the pointers to which it is applied are the only pointers that access the memory they point to within the scope of this function. This means that the memory regions pointed to by these pointers do not overlap. --> Given this hint, the compiler can provide with the better optimization to load and store from this data region. ### Arithmetic Intensity Defined as the number of FLOPs executed per byte transferred (load + store!) between GMEM and SMEM. A good example for this concept is no threadtiling GEMM vs. 1D threadtiling GEMM vs. 2d threadtiling GEMM. ![image](https://hackmd.io/_uploads/HJJDV8dw0.png) ![image](https://hackmd.io/_uploads/H1vsVUdwR.png) So the rule here is to increase the arithmetic intensity as long as your kernel is still memory-bound (your kernel spends a lot of time doing transfer data). The intuition here is when your new kernel spends the same time do transfering, it can compute more results now. ### Vectorized Memory Access https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ https://github.com/HuyNguyen-hust/vma-101 VMA is for increasing bandwidth while decreasing the number of executed instructions. The magic lies in the ```reinterprete_cast```. The ultimate goal is to fuse the load, store instructions into one instruction (or memory transaction). To do that it casts the pointer from smaller type pointer(int*, float*, ...) to larger type pointer(int2*, int4*, ...). For example: ```cpp! reinterpret_cast<int2*>(d_out)[i] = reinterpret_cast<int2*>(d_in)[i]; // d_out and d_in are int pointers ``` This treats d_out and d_in as int2 pointers, hence, load and store in 8 bytes-width fashion (which is equal to 2 int values). So instead of doing 2 loads and stores, now it needs only one call. **Warning:** This technique requires the alignment of the data. Device-allocated memory is automatically aligned to a multiple of the size of the data type, but if you offset the pointer the offset must also be aligned. For example, ```reinterpret_cast<int2*>(d_in+1)``` is invalid because ```d_in+1``` is not aligned to a multiple of ```sizeof(int2)```. ### WMMA API: Warp Matrix Multiplication Accumulate To use this api, do ```#include <mma.h>``` Check this example for understanding: https://github.com/HuyNguyen-hust/gemm-101/blob/main/src/07_2d_block_tiling_2d_warp_tiling_transposed_wmma.cu Note that, the matrix A in WMMA is of size MxK, so if you have A tile transposed, you need to specify A fragment in col major, so it can read the fragment right. ```cpp! nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, WMMA_TILE_SIZE_M, WMMA_TILE_SIZE_N, WMMA_TILE_SIZE_K, T, nvcuda::wmma::col_major> A_fragments[NUM_WMMA_TILES_M]; ``` ## Common C/Cpp libraries - stdio.h: The stdio.h header file is essential for input and output operations in C. It contains functions like printf(), scanf(), fopen(), and fclose() that are used for printing to the screen, reading from the keyboard, and handling file operations. - iostream: the stdio.h of cpp - stdlib.h: On the other hand, stdlib.h is crucial for memory management and general-purpose functions in C. It includes functions like malloc(), free(), abort(), and exit() that are used for dynamic memory allocation, freeing memory, program termination, and conversions between data types. - cmath: This header declares a set of functions to compute common mathematical operations and transformations, such as trigonometric functions, logarithms, exponentials, and more. It provides a wide range of mathematical functions that are useful for scientific and engineering applications. - cassert: This header provides a macro called assert() for documenting C++ program assertions and a mechanism for disabling the assertion checks through defining the macro NDEBUG. The assert() macro is used for debugging purposes to check if a certain condition is true at a specific point in the code. If the condition is false, it prints a diagnostic message and calls the abort() function. ## Common Cuda libraries - **note**: Any header files (.hpp, .cuh, .h) which include Cuda libraries must be included in at least 1 cuda (.cu) file, then the nvcc compiler can compile. Otherwise, in case it is only included in .cpp files, this will lead to errors. - cuda_runtime.h: one of the most commonly included header files in CUDA applications. - Memory Management: Functions to allocate, deallocate, and copy memory between host (CPU) and device (GPU). - Kernel Launching: Macros and functions to define and launch kernels, which are functions that run on the GPU. - Device Management: Functions to query and set device properties, manage device initialization and reset. - Error Handling: Functions to query and handle errors related to CUDA operations. - Stream and Event Management: Functions to create and manage streams and events, which are used to control the order of execution of operations on the GPU. - Math Functions: Provides GPU-accelerated mathematical functions. - cuda_fp16.h: a header file provided by NVIDIA CUDA that facilitates the use of half-precision floating-point arithmetic (16-bit floats) in CUDA programs. - Data Type Definitions: It defines the __half and __half2 data types, where __half represents a single 16-bit floating-point value and __half2 represents two packed 16-bit floating-point values. - Arithmetic Operations: It provides functions and intrinsics for performing arithmetic operations on __half and __half2 types, including addition, subtraction, multiplication, division, and more. - Conversion Functions: It includes functions for converting between __half and other floating-point types such as float and double. - Vectorized Operations: The __half2 type allows for SIMD (Single Instruction, Multiple Data) operations, enabling parallel processing of two half-precision values simultaneously.