Copyright	[2009..2017] Trevor L. McDonell
License	BSD
Safe Haskell	None
Language	Haskell98

Foreign.CUDA.Analysis.Occupancy

Description

Occupancy calculations for CUDA kernels

http://developer.download.nvidia.com/compute/cuda/3_0/sdk/docs/CUDA_Occupancy_calculator.xls

Determining Registers Per Thread and Shared Memory Per Block

To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option

--ptxas-options=-v

to nvcc. This will output information about register, local memory, shared memory, and constant memory usage for each kernel in the .cu file. Alternatively, you can compile with the -cubin option to nvcc. This will generate a .cubin file, which you can open in a text editor. Look for the code section with your kernel's name. Within the curly braces ({ ... }) for that code block, you will see a line with reg = X, where x is the number of registers used by your kernel. You can also see the amount of shared memory used as smem = Y. However, if your kernel declares any external shared memory that is allocated dynamically, you will need to add the number in the .cubin file to the amount you dynamically allocate at run time to get the correct shared memory usage.

Notes About Occupancy

Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth bound, then increasing occupancy will not necessarily increase performance. If a kernel invocation is already running at least one thread block per multiprocessor in the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to local memory (which is off chip), divergent branches, etc. As with any optimization, you should experiment to see how changes affect the *wall clock time* of the kernel execution. For bandwidth bound applications, on the other hand, increasing occupancy can help better hide the latency of memory accesses, and therefore improve performance.

Synopsis

Documentation

data Occupancy Source #

Constructors

Occupancy
Fields activeThreads :: !Int Active threads per multiprocessor activeThreadBlocks :: !Int Active thread blocks per multiprocessor activeWarps :: !Int Active warps per multiprocessor occupancy100 :: !Double Occupancy of each multiprocessor (percent)

Instances

Eq Occupancy Source #
Methods (==) :: Occupancy -> Occupancy -> Bool # (/=) :: Occupancy -> Occupancy -> Bool #
Ord Occupancy Source #
Methods compare :: Occupancy -> Occupancy -> Ordering # (<) :: Occupancy -> Occupancy -> Bool # (<=) :: Occupancy -> Occupancy -> Bool # (>) :: Occupancy -> Occupancy -> Bool # (>=) :: Occupancy -> Occupancy -> Bool # max :: Occupancy -> Occupancy -> Occupancy # min :: Occupancy -> Occupancy -> Occupancy #
Show Occupancy Source #
Methods showsPrec :: Int -> Occupancy -> ShowS # show :: Occupancy -> String # showList :: [Occupancy] -> ShowS #

occupancy Source #

Arguments

:: DeviceProperties	Properties of the card in question
-> Int	Threads per block
-> Int	Registers per thread
-> Int	Shared memory per block (bytes)
-> Occupancy

Calculate occupancy data for a given GPU and kernel resource usage

optimalBlockSize Source #

Arguments

:: DeviceProperties	Architecture to optimise for
-> (Int -> Int)	Register count as a function of thread block size
-> (Int -> Int)	Shared memory usage (bytes) as a function of thread block size
-> (Int, Occupancy)

Optimise multiprocessor occupancy as a function of thread block size and resource usage. This returns the smallest satisfying block size in increments of a single warp.

optimalBlockSizeOf Source #

Arguments

:: DeviceProperties	Architecture to optimise for
-> [Int]	Thread block sizes to consider
-> (Int -> Int)	Register count as a function of thread block size
-> (Int -> Int)	Shared memory usage (bytes) as a function of thread block size
-> (Int, Occupancy)

As optimalBlockSize, but with a generator that produces the specific thread block sizes that should be tested. The generated list can produce values in any order, but the last satisfying block size will be returned. Hence, values should be monotonically decreasing to return the smallest block size yielding maximum occupancy, and vice-versa.

maxResidentBlocks Source #

Arguments

:: DeviceProperties	Properties of the card in question
-> Int	Threads per block
-> Int	Registers per thread
-> Int	Shared memory per block (bytes)
-> Int	Maximum number of resident blocks

Determine the maximum number of CTAs that can be run simultaneously for a given kernel / device combination.

incPow2 :: DeviceProperties -> [Int] Source #

Increments in powers-of-two, over the range of supported thread block sizes for the given device.

incWarp :: DeviceProperties -> [Int] Source #

Increments in the warp size of the device, over the range of supported thread block sizes.

decPow2 :: DeviceProperties -> [Int] Source #

Decrements in powers-of-two, over the range of supported thread block sizes for the given device.

decWarp :: DeviceProperties -> [Int] Source #

Decrements in the warp size of the device, over the range of supported thread block sizes.