Copyright | [2014..2017] Trevor L. McDonell |
---|---|
License | BSD3 |
Maintainer | Trevor L. McDonell <tmcdonell@cse.unsw.edu.au> |
Stability | experimental |
Portability | non-portable (GHC extensions) |
Safe Haskell | None |
Language | Haskell2010 |
The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) for NVIDIA GPUs.
To use operations from the cuBLAS library, the user must allocate the required matrices and vectors in the GPU memory space, fill them with data, call the desired sequence of cuBLAS functions, then copy the results from the GPU memory space back to the host.
The cuda package can be used for writing to and retrieving data from the GPU.
- Data layout
Unlike modern BLAS libraries, cuBLAS only provides Fortran-style implementations of the subprograms, using column-major storage and 1-based indexing.
The ?geam
operation can be used to perform matrix transposition.
- Example
At a short example, we show how to compute the following matrix-matrix
product with dgemm
:
\[ \left(\begin{matrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ \end{matrix}\right) \cdot \left(\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ \end{matrix}\right) = \left(\begin{matrix} 9 & 12 & 15 \\ 19 & 26 & 33 \\ 29 & 40 & 51 \\ \end{matrix}\right) \]
I assume you know how to initialise the CUDA environment, as described in the Foreign.CUDA.Driver module:
>>>
import Foreign.CUDA.Driver as CUDA
>>>
import Foreign.CUDA.BLAS as BLAS
>>>
CUDA.initialise []
>>>
dev <- CUDA.device 0
>>>
ctx <- CUDA.create dev []
Just as we must create a CUDA execution context with
create
before interacting with the GPU, we must create
a BLAS context handle before executing any cuBLAS library operations, which
will be associated with the current device context:
>>>
hdl <- BLAS.create
Now, let us generate the matrix data on the GPU. (For simplicity in this example we will just marshal the data via lists, but in a real application with a large amount of data we should of course use some kind of unboxed array):
>>>
let rowsA = 3; colsA = 2; sizeA = rowsA * colsA
>>>
let rowsB = 2; colsB = 3; sizeB = rowsB * colsB
>>>
let sizeC = rowsA * colsB
>>>
matA <- CUDA.newListArray (take sizeA [1..])
>>>
matB <- CUDA.newListArray (take sizeB [1..])
>>>
matC <- CUDA.mallocArray sizeC
Note in the above that we store data in row-major order, as is the convention in C. However, the cuBLAS library assumes a column-major representation, as is the style of Fortran. However, we can make use of the following equivalency:
\[ B^T \cdot A^T = (A \cdot B)^T \]
and, since the transposed matrix in column-major representation is equivalent to our matrix in row-major representation, we can avoid any actual data manipulation to get things into a form suitable for cuBLAS (phew!).
The final thing to take care of are the scaling parameters to the dgemm
operation, \(\alpha\) and \(\beta\). By default, it is assumed that these
values reside in host memory, but this setting can be changed with
setPointerMode
; When set to Device
mode, the function
withDevicePtr
can be used to treat the device memory
pointer as a plain pointer to pass to the function.
Now, we are ready to piece it all together:
>>>
import Foreign.Marshal
>>>
with 1.0 $ \alpha ->
>>>
with 0.0 $ \beta ->
>>>
dgemm hdl N N colsB rowsA colsA alpha matB colsB matA colsA beta matC colsB
And retrieve the result:
>>>
print =<< CUDA.peekListArray sizeC matC
[9.0,12.0,15.0,19.0,26.0,33.0,29.0,40.0,51.0]
Finally, we should free
the device memory we allocated,
and release the BLAS context handle:
>>>
BLAS.destroy hdl
- Additional information
For more information, see the NVIDIA cuBLAS documentation:
- module Foreign.CUDA.BLAS.Context
- module Foreign.CUDA.BLAS.Stream
- module Foreign.CUDA.BLAS.Error
- module Foreign.CUDA.BLAS.Level1
- module Foreign.CUDA.BLAS.Level2
- module Foreign.CUDA.BLAS.Level3
Control
module Foreign.CUDA.BLAS.Context
module Foreign.CUDA.BLAS.Stream
module Foreign.CUDA.BLAS.Error
Operations
module Foreign.CUDA.BLAS.Level1
module Foreign.CUDA.BLAS.Level2
module Foreign.CUDA.BLAS.Level3