Safe Haskell	Safe-Inferred
Language	Haskell2010

Futhark.CodeGen.ImpGen.Kernels.Transpose

Description

Carefully optimised implementations of GPU transpositions. Written in ImpCode so we can compile it to both CUDA and OpenCL.

Synopsis

data TransposeType
type TransposeArgs = (VName, TExp Int32, VName, TExp Int32, TExp Int32, TExp Int32, TExp Int32, TExp Int32, TExp Int32, VName)
mapTransposeKernel :: String -> Integer -> TransposeArgs -> PrimType -> TransposeType -> Kernel

Documentation

data TransposeType Source #

Which form of transposition to generate code for.

Constructors

TransposeNormal
TransposeLowWidth
TransposeLowHeight
TransposeSmall	For small arrays that do not benefit from coalescing.

Instances

Instances details

Eq TransposeType Source #
Instance details Defined in Futhark.CodeGen.ImpGen.Kernels.Transpose Methods (==) :: TransposeType -> TransposeType -> Bool # (/=) :: TransposeType -> TransposeType -> Bool #
Ord TransposeType Source #
Instance details Defined in Futhark.CodeGen.ImpGen.Kernels.Transpose Methods compare :: TransposeType -> TransposeType -> Ordering # (<) :: TransposeType -> TransposeType -> Bool # (<=) :: TransposeType -> TransposeType -> Bool # (>) :: TransposeType -> TransposeType -> Bool # (>=) :: TransposeType -> TransposeType -> Bool # max :: TransposeType -> TransposeType -> TransposeType # min :: TransposeType -> TransposeType -> TransposeType #
Show TransposeType Source #
Instance details Defined in Futhark.CodeGen.ImpGen.Kernels.Transpose Methods showsPrec :: Int -> TransposeType -> ShowS # show :: TransposeType -> String # showList :: [TransposeType] -> ShowS #

type TransposeArgs = (VName, TExp Int32, VName, TExp Int32, TExp Int32, TExp Int32, TExp Int32, TExp Int32, TExp Int32, VName) Source #

The types of the arguments accepted by a transposition function.

mapTransposeKernel :: String -> Integer -> TransposeArgs -> PrimType -> TransposeType -> Kernel Source #

Generate a transpose kernel. There is special support to handle input arrays with low width, low height, or both.

Normally when transposing a [2][n] array we would use a FUT_BLOCK_DIM x FUT_BLOCK_DIM group to process a [2][FUT_BLOCK_DIM] slice of the input array. This would mean that many of the threads in a group would be inactive. We try to remedy this by using a special kernel that will process a larger part of the input, by using more complex indexing. In our example, we could use all threads in a group if we are processing (2/FUT_BLOCK_DIM) as large a slice of each rows per group. The variable mulx contains this factor for the kernel to handle input arrays with low height.

See issue #308 on GitHub for more details.

These kernels are optimized to ensure all global reads and writes are coalesced, and to avoid bank conflicts in shared memory. Each thread group transposes a 2D tile of block_dim*2 by block_dim*2 elements. The size of a thread group is block_dim/2 by block_dim*2, meaning that each thread will process 4 elements in a 2D tile. The shared memory array containing the 2D tile consists of block_dim*2 by block_dim*2+1 elements. Padding each row with an additional element prevents bank conflicts from occuring when the tile is accessed column-wise.