Copyright | [2009..2023] Trevor L. McDonell |
---|---|
License | BSD |
Safe Haskell | Safe-Inferred |
Language | Haskell98 |
This module defines an interface to the CUDA driver API. The Driver API
is a lower-level interface to CUDA devices than that provided by the
Runtime API. Using the Driver API, the programmer must deal explicitly
with operations such as initialisation, context management, and loading
(kernel) modules. Although more difficult to use initially, the Driver
API provides more control over how CUDA is used. Furthermore, since it
does not require compiling and linking the program with nvcc
, the
Driver API provides better inter-language compatibility.
The following is a short tutorial on using the Driver API. The steps can
be copied into a file, or run directly in ghci
, in which case ghci
should be launched with the option -fno-ghci-sandbox
. This is because
CUDA maintains CPU-local state, so operations should always be run from
a bound thread.
- Using the Driver API
Before any operation can be performed, the Driver API must be initialised:
>>>
import Foreign.CUDA.Driver
>>>
initialise []
Next, we must select a GPU that we will execute operations on. Each GPU
is assigned a unique identifier (beginning at zero). We can get a handle
to a compute device at a given ordinal using the device
operation.
Given a device handle, we can query the properties of that device using
props
. The number of available CUDA-capable devices is given via
count
. For example:
>>>
count
1>>>
dev0 <- device 0
>>>
props dev0
DeviceProperties {deviceName = "GeForce GT 650M", computeCapability = 3.0, ...}
This package also includes the executable 'nvidia-device-query', which when executed displays the key properties of all available devices. See Foreign.CUDA.Driver.Device for additional operations to query the capabilities or status of a device.
Once you have chosen a device to use, the next step is to create a CUDA
context. A context is associated with a particular device, and all
operations, such as memory allocation and kernel execution, take place
with respect to that context. For example, to create
a new execution
context on CUDA device 0:
>>>
ctx <- create dev0 []
The second argument is a set of ContextFlag
s which control how the
context behaves in various situations, for example, whether or not the
CPU should actively spin when waiting for results from the GPU
(SchedSpin
), or to yield control to other threads instead
(SchedYield
).
The newly created context is now the active context, and all subsequent operations take place within that context. More than one context can be created per device, but resources, such as memory allocated in the GPU, are unique to each context. The module Foreign.CUDA.Driver.Context contains operations for managing multiple contexts. Some devices allow data to be shared between contexts without copying, see Foreign.CUDA.Driver.Context.Peer for more information.
Once the context is no longer needed, it should be destroy
ed in order
to free up any resources that were allocated to it.
>>>
destroy ctx
Each device also has a unique context which is used by the Runtime API. This context can be accessed with the module Foreign.CUDA.Driver.Context.Primary.
- Executing kernels onto the GPU
Once the Driver API is initialised and an execution context is created on the GPU, we can begin to interact with it.
At an example, we'll step through executing the CUDA equivalent of the following Haskell function, which element-wise adds the elements of two arrays:
>>>
vecAdd xs ys = zipWith (+) xs ys
The following CUDA kernel can be used to implement this on the GPU:
extern "C" __global__ void vecAdd(float *xs, float *ys, float *zs, int N) { int ix = blockIdx.x * blockDim.x + threadIdx.x; if ( ix < N ) { zs[ix] = xs[ix] + ys[ix]; } }
Here, the global
keyword marks the function as a kernel that
should be computed on the GPU in data parallel. When we execute this
function on the GPU, (at least) N threads will execute N individual
instances of the kernel function vecAdd
. Each thread will operate on
a single element of each input array to create a single value in the
result. See the CUDA programming guide for more details.
We can save this to a file vector_add.cu
, and compile it using nvcc
into a form that we can then load onto the GPU and execute:
$ nvcc --ptx vector_add.cu
The module Foreign.CUDA.Driver.Module contains functions for loading the resulting .ptx file (or .cubin files) into the running program.
>>>
mdl <- loadFile "vector_add.ptx"
Once finished with the module, it is also a good idea to unload
it.
Modules may export kernel functions, global variables, and texture references. Before we can execute our function, we need to look it up in the module by name.
>>>
vecAdd <- getFun mdl "vecAdd"
Given this reference to our kernel function, we are almost ready to
execute it on the device using launchKernel
, but first, we must create
some data that we can execute the function on.
- Transferring data to and from the GPU
GPUs typically have their own memory which is separate from the CPU's memory, and we need to explicitly copy data back and forth between these two regions. The module Foreign.CUDA.Driver.Marshal provides functions for allocating memory on the GPU, and copying data between the CPU and GPU, as well as directly between multiple GPUs.
For simplicity, we'll use standard Haskell lists for our input and output data structure. Note however that this will have significantly lower effective bandwidth than reading a single contiguous region of memory, so for most practical purposes you will want to use some kind of unboxed array.
>>>
let xs = [1..1024] :: [Float]
>>>
let ys = [2,4..2048] :: [Float]
In CUDA, like C, all memory management is explicit, and arrays on the device must be explicitly allocated and freed. As mentioned previously, data transfer is also explicit. However, we do provide convenience functions for combined allocation and marshalling, as well as bracketed operations.
>>>
xs_dev <- newListArray xs
>>>
ys_dev <- newListArray ys
>>>
zs_dev <- mallocArray 1024 :: IO (DevicePtr Float)
After executing the kernel (see next section), we transfer the result back to the host, and free the memory that was allocated on the GPU.
>>>
zs <- peekListArray 1024 zs_dev
>>>
free xs_dev
>>>
free ys_dev
>>>
free zs_dev
- Piecing it all together
Finally, we have everything in place to execute our operation on the
GPU. Launching a kernel on the GPU consists of creating many threads on
the GPU which all execute the same function, and each thread has
a unique identifier in the grid/block hierarchy which can be used to
identify exactly which element this thread should process (the
blockIdx
and threadIdx
parameters that we saw earlier,
respectively).
To execute our function, we will use a grid of 4 blocks, each containing 256 threads. Thus, a total of 1024 threads will be launched, which will each compute a single element of the output array (recall that our input arrays each have 1024 elements). The module Foreign.CUDA.Analysis.Occupancy contains functions to help determine the ideal thread block size for a given kernel and GPU combination.
>>>
launchKernel vecAdd (4,1,1) (256,1,1) 0 Nothing [VArg xs_dev, VArg ys_dev, VArg zs_dev, IArg 1024]
Note that kernel execution is asynchronous, so we should also wait for the operation to complete before attempting to read the results back.
>>>
sync
And that's it!
- Next steps
As mentioned at the end of the previous section, kernels on the GPU are
executed asynchronously with respect to the host, and other operations
such as data transfers can also be executed asynchronously. This allows
the CPU to continue doing other work while the GPU is busy.
Event
s can be used to check whether an
operation has completed yet.
It is also possible to execute multiple kernels or data transfers
concurrently with each other, by assigning those operations to different
execution Stream
s. Used in conjunction with
Event
s, operations will be scheduled
efficiently only once all dependencies (in the form of
Event
s) have been cleared.
See Foreign.CUDA.Driver.Event and Foreign.CUDA.Driver.Stream for more information on this topic.
Synopsis
- module Foreign.CUDA.Ptr
- data Limit
- newtype Context = Context {
- useContext :: Ptr ()
- data ContextFlag
- data Cache
- data SharedMem
- type StreamPriority = Int
- data PeerFlag
- data PeerAttribute
- get :: IO (Maybe Context)
- create :: Device -> [ContextFlag] -> IO Context
- add :: Context -> [PeerFlag] -> IO ()
- destroy :: Context -> IO ()
- pop :: IO Context
- push :: Context -> IO ()
- sync :: IO ()
- set :: Context -> IO ()
- attach :: Context -> [ContextFlag] -> IO ()
- detach :: Context -> IO ()
- getFlags :: IO [ContextFlag]
- getLimit :: Limit -> IO Int
- setLimit :: Limit -> Int -> IO ()
- getCache :: IO Cache
- setCache :: Cache -> IO ()
- getSharedMem :: IO SharedMem
- setSharedMem :: SharedMem -> IO ()
- getStreamPriorityRange :: IO (StreamPriority, StreamPriority)
- accessible :: Device -> Device -> IO Bool
- remove :: Context -> IO ()
- getAttribute :: PeerAttribute -> Device -> Device -> IO Int
- module Foreign.CUDA.Driver.Device
- module Foreign.CUDA.Driver.Error
- module Foreign.CUDA.Driver.Exec
- module Foreign.CUDA.Driver.Marshal
- module Foreign.CUDA.Driver.Module
- module Foreign.CUDA.Driver.Unified
- module Foreign.CUDA.Driver.Utils
Documentation
module Foreign.CUDA.Ptr
Device limits flags
StackSize | |
PrintfFifoSize | |
MallocHeapSize | |
DevRuntimeSyncDepth | |
DevRuntimePendingLaunchCount | |
MaxL2FetchGranularity | |
PersistingL2CacheSize | |
Max |
data ContextFlag Source #
Context creation flags
SchedAuto | |
SchedSpin | |
SchedYield | |
SchedBlockingSync | |
BlockingSync | Deprecated: use SchedBlockingSync instead |
SchedMask | |
MapHost | |
LmemResizeToMax | |
FlagsMask |
Instances
Bounded ContextFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Base minBound :: ContextFlag # maxBound :: ContextFlag # | |
Enum ContextFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Base succ :: ContextFlag -> ContextFlag # pred :: ContextFlag -> ContextFlag # toEnum :: Int -> ContextFlag # fromEnum :: ContextFlag -> Int # enumFrom :: ContextFlag -> [ContextFlag] # enumFromThen :: ContextFlag -> ContextFlag -> [ContextFlag] # enumFromTo :: ContextFlag -> ContextFlag -> [ContextFlag] # enumFromThenTo :: ContextFlag -> ContextFlag -> ContextFlag -> [ContextFlag] # | |
Show ContextFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Base showsPrec :: Int -> ContextFlag -> ShowS # show :: ContextFlag -> String # showList :: [ContextFlag] -> ShowS # | |
Eq ContextFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Base (==) :: ContextFlag -> ContextFlag -> Bool # (/=) :: ContextFlag -> ContextFlag -> Bool # |
Device shared memory configuration preference
Instances
type StreamPriority = Int Source #
Priority of an execution stream. Work submitted to a higher priority stream may preempt execution of work already executing in a lower priority stream. Lower numbers represent higher priorities.
Possible option values for direct peer memory access
Instances
Enum PeerFlag Source # | |
Defined in Foreign.CUDA.Driver.Context.Peer |
data PeerAttribute Source #
Peer-to-peer attributes
PerformanceRank | |
AccessSupported | |
NativeAtomicSupported | |
AccessAccessSupported | |
CudaArrayAccessSupported |
Instances
Enum PeerAttribute Source # | |
Defined in Foreign.CUDA.Driver.Context.Peer succ :: PeerAttribute -> PeerAttribute # pred :: PeerAttribute -> PeerAttribute # toEnum :: Int -> PeerAttribute # fromEnum :: PeerAttribute -> Int # enumFrom :: PeerAttribute -> [PeerAttribute] # enumFromThen :: PeerAttribute -> PeerAttribute -> [PeerAttribute] # enumFromTo :: PeerAttribute -> PeerAttribute -> [PeerAttribute] # enumFromThenTo :: PeerAttribute -> PeerAttribute -> PeerAttribute -> [PeerAttribute] # | |
Show PeerAttribute Source # | |
Defined in Foreign.CUDA.Driver.Context.Peer showsPrec :: Int -> PeerAttribute -> ShowS # show :: PeerAttribute -> String # showList :: [PeerAttribute] -> ShowS # | |
Eq PeerAttribute Source # | |
Defined in Foreign.CUDA.Driver.Context.Peer (==) :: PeerAttribute -> PeerAttribute -> Bool # (/=) :: PeerAttribute -> PeerAttribute -> Bool # |
get :: IO (Maybe Context) Source #
Return the context bound to the calling CPU thread.
Requires CUDA-4.0.
create :: Device -> [ContextFlag] -> IO Context Source #
Create a new CUDA context and associate it with the calling thread. The
context is created with a usage count of one, and the caller of create
must call destroy
when done using the context. If a context is already
current to the thread, it is supplanted by the newly created context and
must be restored by a subsequent call to pop
.
add :: Context -> [PeerFlag] -> IO () Source #
If the devices of both the current and supplied contexts support unified addressing, then enable allocations in the supplied context to be accessible by the current context.
Note that access is unidirectional, and in order to access memory in the
current context from the peer context, a separate symmetric call to
add
is required.
Requires CUDA-4.0.
destroy :: Context -> IO () Source #
Destroy the specified context, regardless of how many threads it is
current to. The context will be pop
ed from the current thread's
context stack, but if it is current on any other threads it will remain
current to those threads, and attempts to access it will result in an
error.
Pop the current CUDA context from the CPU thread. The context may then
be attached to a different CPU thread by calling push
.
push :: Context -> IO () Source #
Push the given context onto the CPU's thread stack of current contexts. The specified context becomes the CPU thread's current context, so all operations that operate on the current context are affected.
Block until the device has completed all preceding requests. If the
context was created with the SchedBlockingSync
flag, the CPU thread
will block until the GPU has finished its work.
attach :: Context -> [ContextFlag] -> IO () Source #
Deprecated: as of CUDA-4.0
Increments the usage count of the context. API: no context flags are currently supported, so this parameter must be empty.
detach :: Context -> IO () Source #
Deprecated: as of CUDA-4.0
Detach the context, and destroy if no longer used
getFlags :: IO [ContextFlag] Source #
Return the flags that were used to create the current context.
Requires CUDA-7.0
setLimit :: Limit -> Int -> IO () Source #
Specify the size of the call stack, for compute 2.0 devices.
Requires CUDA-3.1.
On devices where the L1 cache and shared memory use the same hardware resources, this function returns the preferred cache configuration for the current context.
Requires CUDA-3.2.
setCache :: Cache -> IO () Source #
On devices where the L1 cache and shared memory use the same hardware resources, this sets the preferred cache configuration for the current context. This is only a preference.
Any function configuration set via
setCacheConfigFun
will be preferred over this
context-wide setting.
Requires CUDA-3.2.
getSharedMem :: IO SharedMem Source #
Return the current size of the shared memory banks in the current
context. On devices with configurable shared memory banks,
setSharedMem
can be used to change the configuration, so that
subsequent kernel launches will by default us the new bank size. On
devices without configurable shared memory, this function returns the
fixed bank size of the hardware.
Requires CUDA-4.2
setSharedMem :: SharedMem -> IO () Source #
On devices with configurable shared memory banks, this function will set the context's shared memory bank size that will be used by default for subsequent kernel launches.
Changing the shared memory configuration between launches may insert a device synchronisation.
Shared memory bank size does not affect shared memory usage or kernel occupancy, but may have major effects on performance. Larger bank sizes allow for greater potential bandwidth to shared memory, but change the kinds of accesses which result in bank conflicts.
Requires CUDA-4.2
getStreamPriorityRange :: IO (StreamPriority, StreamPriority) Source #
Returns the numerical values that correspond to the greatest and least priority execution streams in the current context respectively. Stream priorities follow the convention that lower numerical numbers correspond to higher priorities. The range of meaningful stream priorities is given by the inclusive range [greatestPriority,leastPriority].
Requires CUDA-5.5.
accessible :: Device -> Device -> IO Bool Source #
Queries if the first device can directly access the memory of the second. If
direct access is possible, it can then be enabled with add
.
Requires CUDA-4.0.
remove :: Context -> IO () Source #
Disable direct memory access from the current context to the supplied peer context, and unregisters any registered allocations.
Requires CUDA-4.0.
getAttribute :: PeerAttribute -> Device -> Device -> IO Int Source #
Queries attributes of the link between two devices
Requires CUDA-8.0
since 0.9.0.0
module Foreign.CUDA.Driver.Device
module Foreign.CUDA.Driver.Error
module Foreign.CUDA.Driver.Exec
module Foreign.CUDA.Driver.Marshal
module Foreign.CUDA.Driver.Module
module Foreign.CUDA.Driver.Unified
module Foreign.CUDA.Driver.Utils