Copyright | [2009..2017] Trevor L. McDonell |
---|---|
License | BSD |
Safe Haskell | None |
Language | Haskell98 |
Occupancy calculations for CUDA kernels
http://developer.download.nvidia.com/compute/cuda/3_0/sdk/docs/CUDA_Occupancy_calculator.xls
Determining Registers Per Thread and Shared Memory Per Block
To determine the number of registers used per thread in your kernel, simply compile the kernel code using the option
--ptxas-options=-v
to nvcc. This will output information about register, local memory, shared
memory, and constant memory usage for each kernel in the .cu
file.
Alternatively, you can compile with the -cubin
option to nvcc. This will
generate a .cubin
file, which you can open in a text editor. Look for the
code
section with your kernel's name. Within the curly braces ({ ... }
)
for that code block, you will see a line with reg = X
, where x
is the
number of registers used by your kernel. You can also see the amount of
shared memory used as smem = Y
. However, if your kernel declares any
external shared memory that is allocated dynamically, you will need to add
the number in the .cubin
file to the amount you dynamically allocate at run
time to get the correct shared memory usage.
Notes About Occupancy
Higher occupancy does not necessarily mean higher performance. If a kernel is not bandwidth bound, then increasing occupancy will not necessarily increase performance. If a kernel invocation is already running at least one thread block per multiprocessor in the GPU, and it is bottlenecked by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to local memory (which is off chip), divergent branches, etc. As with any optimization, you should experiment to see how changes affect the *wall clock time* of the kernel execution. For bandwidth bound applications, on the other hand, increasing occupancy can help better hide the latency of memory accesses, and therefore improve performance.
- data Occupancy = Occupancy {
- activeThreads :: !Int
- activeThreadBlocks :: !Int
- activeWarps :: !Int
- occupancy100 :: !Double
- occupancy :: DeviceProperties -> Int -> Int -> Int -> Occupancy
- optimalBlockSize :: DeviceProperties -> (Int -> Int) -> (Int -> Int) -> (Int, Occupancy)
- optimalBlockSizeOf :: DeviceProperties -> [Int] -> (Int -> Int) -> (Int -> Int) -> (Int, Occupancy)
- maxResidentBlocks :: DeviceProperties -> Int -> Int -> Int -> Int
- incPow2 :: DeviceProperties -> [Int]
- incWarp :: DeviceProperties -> [Int]
- decPow2 :: DeviceProperties -> [Int]
- decWarp :: DeviceProperties -> [Int]
Documentation
Occupancy | |
|
:: DeviceProperties | Properties of the card in question |
-> Int | Threads per block |
-> Int | Registers per thread |
-> Int | Shared memory per block (bytes) |
-> Occupancy |
Calculate occupancy data for a given GPU and kernel resource usage
:: DeviceProperties | Architecture to optimise for |
-> (Int -> Int) | Register count as a function of thread block size |
-> (Int -> Int) | Shared memory usage (bytes) as a function of thread block size |
-> (Int, Occupancy) |
Optimise multiprocessor occupancy as a function of thread block size and resource usage. This returns the smallest satisfying block size in increments of a single warp.
:: DeviceProperties | Architecture to optimise for |
-> [Int] | Thread block sizes to consider |
-> (Int -> Int) | Register count as a function of thread block size |
-> (Int -> Int) | Shared memory usage (bytes) as a function of thread block size |
-> (Int, Occupancy) |
As optimalBlockSize
, but with a generator that produces the specific thread
block sizes that should be tested. The generated list can produce values in
any order, but the last satisfying block size will be returned. Hence, values
should be monotonically decreasing to return the smallest block size yielding
maximum occupancy, and vice-versa.
:: DeviceProperties | Properties of the card in question |
-> Int | Threads per block |
-> Int | Registers per thread |
-> Int | Shared memory per block (bytes) |
-> Int | Maximum number of resident blocks |
Determine the maximum number of CTAs that can be run simultaneously for a given kernel / device combination.
incPow2 :: DeviceProperties -> [Int] Source #
Increments in powers-of-two, over the range of supported thread block sizes for the given device.
incWarp :: DeviceProperties -> [Int] Source #
Increments in the warp size of the device, over the range of supported thread block sizes.
decPow2 :: DeviceProperties -> [Int] Source #
Decrements in powers-of-two, over the range of supported thread block sizes for the given device.
decWarp :: DeviceProperties -> [Int] Source #
Decrements in the warp size of the device, over the range of supported thread block sizes.