Copyright | [2014..2017] Trevor L. McDonell [2014..2014] Vinod Grover (NVIDIA Corporation) |
---|---|
License | BSD3 |
Maintainer | Trevor L. McDonell <tmcdonell@cse.unsw.edu.au> |
Stability | experimental |
Portability | non-portable (GHC extensions) |
Safe Haskell | None |
Language | Haskell2010 |
This module implements a backend for the Accelerate language targeting multicore CPUs. Expressions are on-line translated into LLVM code, which is just-in-time executed in parallel over the available CPUs. Functions are automatically parallel, provided you specify '+RTS -Nwhatever' on the command line when running the program.
- data Acc a :: * -> *
- class (Typeable * a, Typeable * (ArrRepr a)) => Arrays a
- run :: Arrays a => Acc a -> a
- runWith :: Arrays a => Native -> Acc a -> a
- run1 :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> b
- run1With :: (Arrays a, Arrays b) => Native -> (Acc a -> Acc b) -> a -> b
- stream :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> [a] -> [b]
- streamWith :: (Arrays a, Arrays b) => Native -> (Acc a -> Acc b) -> [a] -> [b]
- data Async a :: * -> *
- wait :: Async a -> IO a
- poll :: Async a -> IO (Maybe a)
- cancel :: Async a -> IO ()
- runAsync :: Arrays a => Acc a -> IO (Async a)
- runAsyncWith :: Arrays a => Native -> Acc a -> IO (Async a)
- run1Async :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> IO (Async b)
- run1AsyncWith :: (Arrays a, Arrays b) => Native -> (Acc a -> Acc b) -> a -> IO (Async b)
- data Native
- type Strategy = Gang -> Executable
- createTarget :: [Int] -> Strategy -> IO Native
- balancedParIO :: Int -> Strategy
- unbalancedParIO :: Strategy
Documentation
Accelerate is an embedded language that distinguishes between vanilla arrays (e.g. in Haskell memory on the CPU) and embedded arrays (e.g. in device memory on a GPU), as well as the computations on both of these. Since Accelerate is an embedded language, programs written in Accelerate are not compiled by the Haskell compiler (GHC). Rather, each Accelerate backend is a runtime compiler which generates and executes parallel SIMD code of the target language at application runtime.
The type constructor Acc
represents embedded collective array operations.
A term of type Acc a
is an Accelerate program which, once executed, will
produce a value of type a
(an Array
or a tuple of Arrays
). Collective
operations of type Acc a
comprise many scalar expressions, wrapped in
type constructor Exp
, which will be executed in parallel. Although
collective operations comprise many scalar operations executed in parallel,
scalar operations cannot initiate new collective operations: this
stratification between scalar operations in Exp
and array operations in
Acc
helps statically exclude nested data parallelism, which is difficult
to execute efficiently on constrained hardware such as GPUs.
For example, to compute a vector dot product we could write:
dotp :: Num a => Vector a -> Vector a -> Acc (Scalar a) dotp xs ys = let xs' = use xs ys' = use ys in fold (+) 0 ( zipWith (*) xs' ys' )
The function dotp
consumes two one-dimensional arrays (Vector
s) of
values, and produces a single (Scalar
) result as output. As the return type
is wrapped in the type Acc
, we see that it is an embedded Accelerate
computation - it will be evaluated in the object language of dynamically
generated parallel code, rather than the meta language of vanilla Haskell.
As the arguments to dotp
are plain Haskell arrays, to make these available
to Accelerate computations they must be embedded with the
use
function.
An Accelerate backend is used to evaluate the embedded computation and return
the result back to vanilla Haskell. Calling the run
function of a backend
will generate code for the target architecture, compile, and execute it. For
example, the following backends are available:
- accelerate-llvm-native: for execution on multicore CPUs
- accelerate-llvm-ptx: for execution on NVIDIA CUDA-capable GPUs
See also Exp
, which encapsulates embedded scalar computations.
- Fusion:
Array computations of type Acc
will be subject to array fusion;
Accelerate will combine individual Acc
computations into a single
computation, which reduces the number of traversals over the input data and
thus improves performance. As such, it is often useful to have some intuition
on when fusion should occur.
The main idea is to first partition array operations into two categories:
- Element-wise operations, such as
map
,generate
, andbackpermute
. Each element of these operations can be computed independently of all others. - Collective operations such as
fold
,scanl
, andstencil
. To compute each output element of these operations requires reading multiple elements from the input array(s).
Element-wise operations fuse together whenever the consumer operation uses a single element of the input array. Element-wise operations can both fuse their inputs into themselves, as well be fused into later operations. Both these examples should fuse into a single loop:
map -> reverse -> reshape -> map -> map
map -> backpermute -> zipWith -> map generate ->
If the consumer operation uses more than one element of the input array
(typically, via generate
indexing an array multiple
times), then the input array will be completely evaluated first; no fusion
occurs in this case, because fusing the first operation into the second
implies duplicating work.
On the other hand, collective operations can fuse their input arrays into themselves, but on output always evaluate to an array; collective operations will not be fused into a later step. For example:
use -> zipWith -> fold |-> map generate ->
Here the element-wise sequence (use
+ generate
+ zipWith
) will
fuse into a single operation, which then fuses into the collective
fold
operation. At this point in the program the
fold
must now be evaluated. In the final step the
map
reads in the array produced by
fold
. As there is no fusion between the
fold
and map
steps, this
program consists of two "loops"; one for the use
+ generate
+ zipWith
+ fold
step, and one for the final
map
step.
You can see how many operations will be executed in the fused program by
Show
-ing the Acc
program, or by using the debugging option -ddump-dot
to save the program as a graphviz DOT file.
As a special note, the operations unzip
and
reshape
, when applied to a real array, are executed
in constant time, so in this situation these operations will not be fused.
- Tips:
- Since
Acc
represents embedded computations that will only be executed when evaluated by a backend, we can programatically generate these computations using the meta language Haskell; for example, unrolling loops or embedding input values into the generated code. - It is usually best to keep all intermediate computations in
Acc
, and onlyrun
the computation at the very end to produce the final result. This enables optimisations between intermediate results (e.g. array fusion) and, if the target architecture has a separate memory space as is the case of GPUs, to prevent excessive data transfers.
class (Typeable * a, Typeable * (ArrRepr a)) => Arrays a #
Arrays
consists of nested tuples of individual Array
s, currently up to
15-elements wide. Accelerate computations can thereby return multiple
results.
arrays, flavour, toArr, fromArr
Arrays () | |
(Arrays a, Arrays b) => Arrays (a, b) | |
(Shape sh, Elt e) => Arrays (Array sh e) | |
(Arrays a, Arrays b, Arrays c) => Arrays (a, b, c) | |
(Arrays a, Arrays b, Arrays c, Arrays d) => Arrays (a, b, c, d) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e) => Arrays (a, b, c, d, e) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f) => Arrays (a, b, c, d, e, f) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g) => Arrays (a, b, c, d, e, f, g) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h) => Arrays (a, b, c, d, e, f, g, h) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i) => Arrays (a, b, c, d, e, f, g, h, i) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j) => Arrays (a, b, c, d, e, f, g, h, i, j) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k) => Arrays (a, b, c, d, e, f, g, h, i, j, k) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m, Arrays n) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m, n) | |
(Arrays a, Arrays b, Arrays c, Arrays d, Arrays e, Arrays f, Arrays g, Arrays h, Arrays i, Arrays j, Arrays k, Arrays l, Arrays m, Arrays n, Arrays o) => Arrays (a, b, c, d, e, f, g, h, i, j, k, l, m, n, o) | |
Synchronous execution
run :: Arrays a => Acc a -> a Source #
Compile and run a complete embedded array program.
NOTE: it is recommended to use run1
whenever possible.
runWith :: Arrays a => Native -> Acc a -> a Source #
As run
, but execute using the specified target (thread gang).
run1 :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> b Source #
Prepare and execute an embedded array program of one argument.
This function can be used to improve performance in cases where the array
program is constant between invocations, because it enables us to bypass
front-end conversion stages and move directly to the execution phase. If you
have a computation applied repeatedly to different input data, use this,
specifying any changing aspects of the computation via the input parameter.
If the function is only evaluated once, this is equivalent to run
.
To use run1
effectively you must express your program as a function of one
argument. If your program takes more than one argument, you can use
lift
and unlift
to tuple up
the arguments.
At an example, once your program is expressed as a function of one argument, instead of the usual:
step :: Acc (Vector a) -> Acc (Vector b) step = ... simulate :: Vector a -> Vector b simulate xs = run $ step (use xs)
Instead write:
simulate xs = run1 step xs
You can use the debugging options to check whether this is working
successfully by, for example, observing no output from the -ddump-cc
flag
at the second and subsequent invocations.
See the programs in the 'accelerate-examples' package for examples.
run1With :: (Arrays a, Arrays b) => Native -> (Acc a -> Acc b) -> a -> b Source #
As run1
, but execute using the specified target (thread gang).
stream :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> [a] -> [b] Source #
Stream a lazily read list of input arrays through the given program, collecting results as we go.
streamWith :: (Arrays a, Arrays b) => Native -> (Acc a -> Acc b) -> [a] -> [b] Source #
As stream
, but execute using the specified target (thread gang).
Asynchronous execution
Block the calling thread until the computation completes, then return the result.
poll :: Async a -> IO (Maybe a) #
Test whether the asynchronous computation has already completed. If so,
return the result, else Nothing
.
runAsyncWith :: Arrays a => Native -> Acc a -> IO (Async a) Source #
As runAsync
, but execute using the specified target (thread gang).
run1Async :: (Arrays a, Arrays b) => (Acc a -> Acc b) -> a -> IO (Async b) Source #
As run1
, but execute asynchronously.
run1AsyncWith :: (Arrays a, Arrays b) => Native -> (Acc a -> Acc b) -> a -> IO (Async b) Source #
As run1Async
, but execute using the specified target (thread gang).
Execution targets
Native machine code JIT execution target
type Strategy = Gang -> Executable Source #
The strategy for balancing work amongst the available worker threads.
:: [Int] | CPU IDs to launch worker threads on |
-> Strategy | Strategy to balance parallel workloads |
-> IO Native |
Create a Native execution target by spawning a worker thread on each of the given capabilities, and using the given strategy to load balance the workers when executing parallel operations.
Execute a computation where threads use work stealing (based on lazy splitting of work stealing queues and exponential backoff) in order to automatically balance the workload amongst themselves.
unbalancedParIO :: Strategy Source #
Execute a computation without load balancing. Each thread computes an equally sized chunk of the input. No work stealing occurs.