# cuBLAS tutorial ## Version * [C](http://docs.nvidia.com/cuda/cublas/index.html#axzz4s6NNQuOu) * [Python](https://docs.continuum.io/accelerate/install) ## Introduction of functions ### Fundamental functions * ```cublasCreate()```: initialize the handle to the cuBLAS library context. * ```cublasDestory() ```: release the resources associated with the cuBLAS library context. * ```cublasSetPointerMode()``` : if a function returns a scalar, we can make it be a host pointer or device pointer. ### Error status * cublasStatus_t : error status ### Note * cublas can be used at different device, should be called after cudaSetDevice. ### Compile * static : ```nvcc myCublasApp.c -lcublas_static -lculibos -o myCublasApp``` * dynamic : ```nvcc myCublasApp.c -lcublas -o myCublasApp``` --- #### 動態靜態函式庫 * [linux version](http://blog.xuite.net/ian11832/blogg/33493241-Linux%E5%8B%95%E6%85%8B%26%E9%9D%9C%E6%85%8B%E5%87%BD%E5%BC%8F%E5%BA%AB%E8%A7%A3%E8%AA%AA) --- ## Some important BLAS functions in cuBLAS ### Vector in cuBLAS #### cublasSetVector() ```C= cublasStatus_t cublasSetVector(int n, int elemSize, const void *x, int incx, void *y, int incy) ``` * Rmk : * x : in host * y : in GPU * y can be vector/part of matrix #### cublasGetVector() ```C= cublasStatus_t cublasGetVector(int n, int elemSize, const void *x, int incx, void *y, int incy) ``` * Rmk : * x : in GPU * y : in host * x can be vector/part of matrix * Notation * n : get how many elements form host * elemSize : elements size(byte) * incx, incy : [see reference](https://devtalk.nvidia.com/default/topic/378993/function-cublassetvector/?offset=2)![](https://i.imgur.com/JU8SuOA.png) ### Matrix in cuBLAS #### cublasSetMatrix() ```C= cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb) ``` * Rmk : * A : host * B : GPU #### cublasGetMatrix() ```C= cublasStatus_t cublasGetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb) ``` * Rmk : * A : GPU * B : host * Notation: * The leading dimension indicates the number of **rows** of the allocated matrix The leading dimension always refers to the length of the first dimension of the array --- ## BLAS1 vector-vector > We will use abbreviations Using the cuBLAS API <type> for type and <t> for the corresponding short type >Most of the matries are 1-based ### Table | type | t | Meaning | | -----|-----|---------| |float |‘s’ or ‘S’| real single-precision| |double| ‘d’ or ‘D’| real double-precision| |cuComplex| ‘c’ or ‘C’| complex single-precision| |cuDoubleComplex| ‘z’ or ‘Z’| complex double-precision| ### cublasI\<t>amax() ```C= cublasStatus_t cublasIsamax(cublasHandle_t handle, int n, const float *x, int incx, int *result) cublasStatus_t cublasIdamax(cublasHandle_t handle, int n, const double *x, int incx, int *result) cublasStatus_t cublasIcamin(cublasHandle_t handle, int n, const cuComplex *x, int incx, int *result) cublasStatus_t cublasIzamin(cublasHandle_t handle, int n, const cuDoubleComplex *x, int incx, int *result) ``` * Return : index of max/min in the given array * incx : as above ### cublas\<t>axpy() ``` C= cublasStatus_t cublasSaxpy(cublasHandle_t handle, int n, const float *alpha, const float *x, int incx, float *y, int incy) ``` * Return : $y$ where this function calculates $\alpha x+y$ * Rmk : x,y can be part of matrix, when dealing with this case, we would change our incx\incy parameters as above. ### cublas\<t>copy() ```C= cublasStatus_t cublasScopy(cublasHandle_t handle, int n, const float *x, int incx, float *y, int incy) ``` * Return : y where this function copies the vector x into the vector y * Rmk : x,y can be part of matrix, when dealing with this case, we would change our incx\incy parameters as above. ### cublas\<t>dot() ```C= cublasStatus_t cublasSdot (cublasHandle_t handle, int n, const float *x, int incx, const float *y, int incy, float *result) ``` * Rmk : Here we can call the function, ```cublasSetPointerMode``` to modify the result into specific pointer. ### cublas\<t>nrm2() ```C= cublasStatus_t cublasSnrm2 (cublasHandle_t handle, int n, const float *x, int incx, float *result) ``` ### cublas\<t>rot() ```C= cublasStatus_t cublasSrot(cublasHandle_t handle, int n, float *x, int incx, float *y, int incy, const float *c, const float *s) ``` * Rmk : where s is the entry of the rotation matrix $G = \begin{pmatrix} c & s \\ -s & c \end{pmatrix}$ , s is the sin value, c is the cos value. ### cublas\<t>scal() ```C= cublasStatus_t cublasSrot(cublasHandle_t handle, int n, const float *alpha, float *x, int incx) ``` * Rmk : return $x = \alpha \cdot x$ ## BLAS2 matrix-vector ### cublas\<t>gemv() ```C= cublasStatus_t cublasSgemv(cublasHandle_t handle, cublasOperation_t trans, int m, int n, const float *alpha, const float *A, int lda, const float *x, int incx, const float *beta, float *y, int incy) ``` * Return : $\alpha \cdot op(A) \cdot x+ \beta \cdot y$ * Rmk : cublasOperation_t = CUBLAS_OP_T/CUBLAS_OP_N/CUBLAS_OP_H ### cublas\<t>ger() ```C= cublasStatus_t cublasSger(cublasHandle_t handle, int m, int n, const float *alpha, const float *x, int incx, const float *y, int incy, float *A, int lda) ``` * Return : $\alpha xy^{T or H} + A$ (T or H is related to which function you call) ------------------------------- ### Particular functions > Following functions are some particular format wich s represent symmetric and t represent triangular. ------------------------------- #### cublas\<t>spmv() ```C= cublasStatus_t cublasSsbmv(cublasHandle_t handle, cublasFillMode_t uplo, int n, int k, const float *alpha, const float *A, int lda, const float *x, int incx, const float *beta, float *y, int incy) ``` * Return : $\alpha Ax+\beta y$ * Rmk : * A is a symmetric matrix stored in packed format * packed format ![](https://i.imgur.com/MHsnimd.png) * some rmk in toolkit ![](https://i.imgur.com/B240pyF.png) #### cublas\<t>tbmv() ```C= cublasStatus_t cublasStbmv(cublasHandle_t handle, cublasFillMode_t uplo, cublasOperation_t trans, cublasDiagType_t diag, int n, int k, const float *A, int lda, float *x, int incx) ``` * Return : $x = Ax$ * Rmk : triangular banded matrix-vector multiplication * Rmk in toolkit ![](https://i.imgur.com/36t9c2y.png) #### cublas\<t>tbsv() ```C= cublasStatus_t cublasStbsv(cublasHandle_t handle, cublasFillMode_t uplo, cublasOperation_t trans, cublasDiagType_t diag, int n, int k, const float *A, int lda, float *x, int incx) ``` * Solving $op(A)x=b$ * Notation : * cublasDiagType_t ==CUBLAS_DIAG_NON_UNIT or CUBLAS_DIAG_UNIT * A is triangular --- ## BLAS3 matrix-matrix ### cublas\<t>gemm() ```C= cublasStatus_t cublasSgemm(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const float *alpha, const float *A, int lda, const float *B, int ldb, const float *beta, float *C, int ldc) ``` * Return : $\alpha\cdot op(A)op(B) + \beta C$ ### cublas\<t>geam() ```C= cublasStatus_t cublasSgeam(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, const float *alpha, const float *A, int lda, const float *bet, const float *B, int ldb, float *C, int ldc) ``` * Return : $C=\alpha A+\beta B$ * Rmk : [在地演算法](https://zh.wikipedia.org/wiki/%E5%8E%9F%E5%9C%B0%E7%AE%97%E6%B3%95)只在A或B=C時能使用 ### cublas\<t>dgmm() ```C= cublasStatust cublasSdgmm(cublasHandle_t handle, cublasSideMode_t mode, int m, int n, const float *A, int lda, const float *x, int incx, float *C, int ldc) ``` * Return : $A*diag(X)$ respect to mode --- ## Solveing Ax=b by LU decomposition * calling cublas\<t>getrfBatched() → getting the LU decomposition from which it return. * cublas\<t>getriBatched() → getting the inverse of L and U. ### cublas\<t>getrfBatched() ```C= cublasStatus_t cublasSgetrfBatched(cublasHandle_t handle, int n, float *Aarray[], int lda, int *PivotArray, int *infoArray, int batchSize); ``` * Note : * ```*Aarray[]``` : an array that stores the matrixes that going to be decomposed. * ```lda``` : leading dimension of matrixes that stored in Aarray[] * ```PivotArray``` : this array[i] stored the row that would be exchanged form identity matrix. ![](https://i.imgur.com/eGqL0A2.jpg) * ```infoArray``` : if return != 0 then the process is failed. ### cublas\<t>getriBatched() ```C= cublasStatus_t cublasSgetriBatched(cublasHandle_t handle, int n, float *Aarray[], int lda, int *PivotArray, float *Carray[], int ldc, int *infoArray, int batchSize); ``` * Notation : * ```infoArray``` : if return != 0 then the process is failed. * return C. [reference](https://stackoverflow.com/questions/28794010/solving-linear-systems-ax-b-with-cuda)