Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
Carefully optimised implementations of GPU transpositions. Written in ImpCode so we can compile it to both CUDA and OpenCL.
Documentation
data TransposeType Source #
Which form of transposition to generate code for.
TransposeNormal | |
TransposeLowWidth | |
TransposeLowHeight | |
TransposeSmall | For small arrays that do not benefit from coalescing. |
Instances
Eq TransposeType Source # | |
Defined in Futhark.CodeGen.ImpGen.Kernels.Transpose (==) :: TransposeType -> TransposeType -> Bool # (/=) :: TransposeType -> TransposeType -> Bool # | |
Ord TransposeType Source # | |
Defined in Futhark.CodeGen.ImpGen.Kernels.Transpose compare :: TransposeType -> TransposeType -> Ordering # (<) :: TransposeType -> TransposeType -> Bool # (<=) :: TransposeType -> TransposeType -> Bool # (>) :: TransposeType -> TransposeType -> Bool # (>=) :: TransposeType -> TransposeType -> Bool # max :: TransposeType -> TransposeType -> TransposeType # min :: TransposeType -> TransposeType -> TransposeType # | |
Show TransposeType Source # | |
Defined in Futhark.CodeGen.ImpGen.Kernels.Transpose showsPrec :: Int -> TransposeType -> ShowS # show :: TransposeType -> String # showList :: [TransposeType] -> ShowS # |
type TransposeArgs = (VName, TExp Int32, VName, TExp Int32, TExp Int32, TExp Int32, TExp Int32, TExp Int32, TExp Int32, VName) Source #
The types of the arguments accepted by a transposition function.
mapTransposeKernel :: String -> Integer -> TransposeArgs -> PrimType -> TransposeType -> Kernel Source #
Generate a transpose kernel. There is special support to handle input arrays with low width, low height, or both.
Normally when transposing a [2][n]
array we would use a FUT_BLOCK_DIM x
FUT_BLOCK_DIM
group to process a [2][FUT_BLOCK_DIM]
slice of the input
array. This would mean that many of the threads in a group would be inactive.
We try to remedy this by using a special kernel that will process a larger
part of the input, by using more complex indexing. In our example, we could
use all threads in a group if we are processing (2/FUT_BLOCK_DIM)
as large
a slice of each rows per group. The variable mulx
contains this factor for
the kernel to handle input arrays with low height.
See issue #308 on GitHub for more details.
These kernels are optimized to ensure all global reads and writes are coalesced, and to avoid bank conflicts in shared memory. Each thread group transposes a 2D tile of block_dim*2 by block_dim*2 elements. The size of a thread group is block_dim/2 by block_dim*2, meaning that each thread will process 4 elements in a 2D tile. The shared memory array containing the 2D tile consists of block_dim*2 by block_dim*2+1 elements. Padding each row with an additional element prevents bank conflicts from occuring when the tile is accessed column-wise.