futhark-0.25.15: An optimising compiler for a functional, array-oriented language.
Safe HaskellSafe-Inferred
LanguageGHC2021

Futhark.CodeGen.ImpGen.GPU.SegRed

Description

We generate code for non-segmented/single-segment SegRed using the basic approach outlined in the paper "Design and GPGPU Performance of Futhark’s Redomap Construct" (ARRAY '16). The main deviations are:

  • While we still use two-phase reduction, we use only a single kernel, with the final threadblock to write a result (tracked via an atomic counter) performing the final reduction as well.
  • Instead of depending on storage layout transformations to handle non-commutative reductions efficiently, we slide a tblocksize-sized window over the input, and perform a parallel reduction for each window. This sacrifices the notion of efficient sequentialisation, but is sometimes faster and definitely simpler and more predictable (and uses less auxiliary storage).

For segmented reductions we use the approach from "Strategies for Regular Segmented Reductions on GPU" (FHPC '17). This involves having two different strategies, and dynamically deciding which one to use based on the number of segments and segment size. We use the (static) tblock_size to decide which of the following two strategies to choose:

  • Large: uses one or more blocks to process a single segment. If multiple blocks are used per segment, the intermediate reduction results must be recursively reduced, until there is only a single value per segment.

Each thread can read multiple elements, which will greatly increase performance; however, if the reduction is non-commutative we will have to use a less efficient traversal (with interim block-wide reductions) to enable coalesced memory accesses, just as in the non-segmented case.

  • Small: is used to let each block process *multiple* segments within a block. We will only use this approach when we can process at least two segments within a single block. In those cases, we would allocate a whole block per segment with the large strategy, but at most 50% of the threads in the block would have any element to read, which becomes highly inefficient.

An optimization specfically targeted at non-segmented and large-segments segmented reductions with non-commutative is made: The stage one main loop is essentially stripmined by a factor *chunk*, inserting collective copies via shared memory of each reduction parameter going into the intra-block (partial) reductions. This saves a factor *chunk* number of intra-block reductions at the cost of some overhead in collective copies.

Synopsis

Documentation

compileSegRed :: Pat LetDecMem -> SegLevel -> SegSpace -> [SegBinOp GPUMem] -> KernelBody GPUMem -> CallKernelGen () Source #

Compile SegRed instance to host-level code with calls to various kernels.

compileSegRed' :: Pat LetDecMem -> KernelGrid -> SegSpace -> [SegBinOp GPUMem] -> DoSegBody -> CallKernelGen () Source #

Like compileSegRed, but where the body is a monadic action.

type DoSegBody = ([(SubExp, [TExp Int64])] -> InKernelGen ()) -> InKernelGen () Source #

Code generation for the body of the SegRed, taking a continuation for saving the results of the body. The results should be represented as a pairing of a SubExp along with a list of indexes into that SubExp for reading the result.