# cuBLAS tutorial
## Version
* [C](http://docs.nvidia.com/cuda/cublas/index.html#axzz4s6NNQuOu)
* [Python](https://docs.continuum.io/accelerate/install)
## Introduction of functions
### Fundamental functions
* ```cublasCreate()```: initialize the handle to the cuBLAS library context.
* ```cublasDestory() ```: release the resources associated with the cuBLAS library context.
* ```cublasSetPointerMode()``` : if a function returns a scalar, we can make it be a host pointer or device pointer.
### Error status
* cublasStatus_t : error status
### Note
* cublas can be used at different device, should be called after cudaSetDevice.
### Compile
* static :
```nvcc myCublasApp.c -lcublas_static -lculibos -o myCublasApp```
* dynamic :
```nvcc myCublasApp.c -lcublas -o myCublasApp```
---
#### 動態靜態函式庫
* [linux version](http://blog.xuite.net/ian11832/blogg/33493241-Linux%E5%8B%95%E6%85%8B%26%E9%9D%9C%E6%85%8B%E5%87%BD%E5%BC%8F%E5%BA%AB%E8%A7%A3%E8%AA%AA)
---
## Some important BLAS functions in cuBLAS
### Vector in cuBLAS
#### cublasSetVector()
```C=
cublasStatus_t
cublasSetVector(int n, int elemSize,
const void *x, int incx, void *y, int incy)
```
* Rmk :
* x : in host
* y : in GPU
* y can be vector/part of matrix
#### cublasGetVector()
```C=
cublasStatus_t
cublasGetVector(int n, int elemSize,
const void *x, int incx, void *y, int incy)
```
* Rmk :
* x : in GPU
* y : in host
* x can be vector/part of matrix
* Notation
* n : get how many elements form host
* elemSize : elements size(byte)
* incx, incy : [see reference](https://devtalk.nvidia.com/default/topic/378993/function-cublassetvector/?offset=2)
### Matrix in cuBLAS
#### cublasSetMatrix()
```C=
cublasStatus_t
cublasSetMatrix(int rows, int cols, int elemSize,
const void *A, int lda, void *B, int ldb)
```
* Rmk :
* A : host
* B : GPU
#### cublasGetMatrix()
```C=
cublasStatus_t
cublasGetMatrix(int rows, int cols, int elemSize,
const void *A, int lda, void *B, int ldb)
```
* Rmk :
* A : GPU
* B : host
* Notation:
* The leading dimension indicates the number of **rows** of the allocated matrix
The leading dimension always refers to the length of the first dimension of the array
---
## BLAS1 vector-vector
> We will use abbreviations Using the cuBLAS API <type> for type and <t> for the corresponding short type
>Most of the matries are 1-based
### Table
| type | t | Meaning |
| -----|-----|---------|
|float |‘s’ or ‘S’| real single-precision|
|double| ‘d’ or ‘D’| real double-precision|
|cuComplex| ‘c’ or ‘C’| complex single-precision|
|cuDoubleComplex| ‘z’ or ‘Z’| complex double-precision|
### cublasI\<t>amax()
```C=
cublasStatus_t cublasIsamax(cublasHandle_t handle, int n, const float *x, int incx, int *result)
cublasStatus_t cublasIdamax(cublasHandle_t handle, int n, const double *x, int incx, int *result)
cublasStatus_t cublasIcamin(cublasHandle_t handle, int n, const cuComplex *x, int incx, int *result)
cublasStatus_t cublasIzamin(cublasHandle_t handle, int n, const cuDoubleComplex *x, int incx, int *result)
```
* Return : index of max/min in the given array
* incx : as above
### cublas\<t>axpy()
``` C=
cublasStatus_t cublasSaxpy(cublasHandle_t handle, int n, const float *alpha, const float *x, int incx, float *y, int incy)
```
* Return : $y$ where this function calculates $\alpha x+y$
* Rmk : x,y can be part of matrix, when dealing with this case, we would change our incx\incy parameters as above.
### cublas\<t>copy()
```C=
cublasStatus_t cublasScopy(cublasHandle_t handle, int n, const float *x, int incx, float *y, int incy)
```
* Return : y where this function copies the vector x into the vector y
* Rmk : x,y can be part of matrix, when dealing with this case, we would change our incx\incy parameters as above.
### cublas\<t>dot()
```C=
cublasStatus_t cublasSdot (cublasHandle_t handle, int n, const float *x, int incx, const float *y, int incy, float *result)
```
* Rmk : Here we can call the function, ```cublasSetPointerMode``` to modify the result into specific pointer.
### cublas\<t>nrm2()
```C=
cublasStatus_t cublasSnrm2 (cublasHandle_t handle, int n, const float *x, int incx, float *result)
```
### cublas\<t>rot()
```C=
cublasStatus_t cublasSrot(cublasHandle_t handle, int n, float *x, int incx, float *y, int incy, const float *c, const float *s)
```
* Rmk : where s is the entry of the rotation matrix $G = \begin{pmatrix} c & s \\ -s & c \end{pmatrix}$ , s is the sin value, c is the cos value.
### cublas\<t>scal()
```C=
cublasStatus_t cublasSrot(cublasHandle_t handle, int n, const float *alpha, float *x, int incx)
```
* Rmk : return $x = \alpha \cdot x$
## BLAS2 matrix-vector
### cublas\<t>gemv()
```C=
cublasStatus_t cublasSgemv(cublasHandle_t handle, cublasOperation_t trans, int m, int n, const float *alpha, const float *A, int lda, const float *x, int incx, const float *beta, float *y, int incy)
```
* Return : $\alpha \cdot op(A) \cdot x+ \beta \cdot y$
* Rmk : cublasOperation_t = CUBLAS_OP_T/CUBLAS_OP_N/CUBLAS_OP_H
### cublas\<t>ger()
```C=
cublasStatus_t cublasSger(cublasHandle_t handle, int m, int n, const float *alpha, const float *x, int incx, const float *y, int incy, float *A, int lda)
```
* Return : $\alpha xy^{T or H} + A$ (T or H is related to which function you call)
-------------------------------
### Particular functions
> Following functions are some particular format wich s represent symmetric and t represent triangular.
-------------------------------
#### cublas\<t>spmv()
```C=
cublasStatus_t cublasSsbmv(cublasHandle_t handle, cublasFillMode_t uplo, int n, int k, const float *alpha, const float *A, int lda, const float *x, int incx, const float *beta, float *y, int incy)
```
* Return : $\alpha Ax+\beta y$
* Rmk :
* A is a symmetric matrix stored in packed format
* packed format

* some rmk in toolkit

#### cublas\<t>tbmv()
```C=
cublasStatus_t cublasStbmv(cublasHandle_t handle, cublasFillMode_t uplo, cublasOperation_t trans, cublasDiagType_t diag, int n, int k, const float *A, int lda, float *x, int incx)
```
* Return : $x = Ax$
* Rmk : triangular banded matrix-vector multiplication
* Rmk in toolkit

#### cublas\<t>tbsv()
```C=
cublasStatus_t cublasStbsv(cublasHandle_t handle, cublasFillMode_t uplo, cublasOperation_t trans, cublasDiagType_t diag, int n, int k, const float *A, int lda, float *x, int incx)
```
* Solving $op(A)x=b$
* Notation :
* cublasDiagType_t ==CUBLAS_DIAG_NON_UNIT or CUBLAS_DIAG_UNIT
* A is triangular
---
## BLAS3 matrix-matrix
### cublas\<t>gemm()
```C=
cublasStatus_t cublasSgemm(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n, int k,
const float *alpha,
const float *A, int lda,
const float *B, int ldb,
const float *beta,
float *C, int ldc)
```
* Return : $\alpha\cdot op(A)op(B) + \beta C$
### cublas\<t>geam()
```C=
cublasStatus_t cublasSgeam(cublasHandle_t handle,
cublasOperation_t transa,
cublasOperation_t transb,
int m, int n,
const float *alpha,
const float *A, int lda, const float *bet,
const float *B, int ldb, float *C, int ldc)
```
* Return : $C=\alpha A+\beta B$
* Rmk : [在地演算法](https://zh.wikipedia.org/wiki/%E5%8E%9F%E5%9C%B0%E7%AE%97%E6%B3%95)只在A或B=C時能使用
### cublas\<t>dgmm()
```C=
cublasStatust cublasSdgmm(cublasHandle_t handle,
cublasSideMode_t mode,
int m, int n,
const float *A, int lda,
const float *x, int incx,
float *C, int ldc)
```
* Return : $A*diag(X)$ respect to mode
---
## Solveing Ax=b by LU decomposition
* calling cublas\<t>getrfBatched() → getting the LU decomposition from which it return.
* cublas\<t>getriBatched() → getting the inverse of L and U.
### cublas\<t>getrfBatched()
```C=
cublasStatus_t cublasSgetrfBatched(cublasHandle_t handle,
int n,
float *Aarray[],
int lda,
int *PivotArray,
int *infoArray,
int batchSize);
```
* Note :
* ```*Aarray[]``` : an array that stores the matrixes that going to be decomposed.
* ```lda``` : leading dimension of matrixes that stored in Aarray[]
* ```PivotArray``` : this array[i] stored the row that would be exchanged form identity matrix.

* ```infoArray``` : if return != 0 then the process is failed.
### cublas\<t>getriBatched()
```C=
cublasStatus_t cublasSgetriBatched(cublasHandle_t handle,
int n,
float *Aarray[],
int lda,
int *PivotArray,
float *Carray[],
int ldc,
int *infoArray,
int batchSize);
```
* Notation :
* ```infoArray``` : if return != 0 then the process is failed.
* return C.
[reference](https://stackoverflow.com/questions/28794010/solving-linear-systems-ax-b-with-cuda)