futhark-0.9.1: An optimising compiler for a functional, array-oriented language.

Safe HaskellNone
LanguageHaskell2010

Futhark.Optimise.TileLoops.RegTiling3D

Description

Perform a restricted form of register tiling corresponding to the following pattern: * a stream is perfectly nested inside a kernel with at least three parallel dimension (the perfectly nested restriction can be relaxed a bit); * all streamed arrays are one dimensional; * all streamed arrays are variant to exacly one of the three innermost parallel dimensions, and conversly for each of the three innermost parallel dimensions, there is at least one streamed array variant to it; * the stream's result is a tuple of scalar values, which are also the "thread-in-space" return of the kernel. Target code can be found in "testsreg-tilingreg-tiling-3d.fut".

Synopsis

Documentation

doRegTiling3D :: Stm Kernels -> TileM (Maybe (Stms Kernels, Stm Kernels)) Source #

Expects a kernel statement as argument. CONDITIONS for 3D tiling optimization to fire are: 1. a) The kernel body can be broken into scalar-code-1 ++ [GroupStream stmt] ++ scalar-code-2. b) The kernels has a "ThreadsReturn ThreadsInSpace" result, and obviously the result is variant to the 3rd dimension (counter from innermost to outermost) 2. For the GroupStream (morally StreamSeq): a) the arrays' outersize must equal the maximal chunk size b) the streamed arrays are one dimensional c) each of the array arguments of GroupStream are variant to exactly one of the three innermost-parallel dimension of the kernel. This condition can be relaxed by interchanging kernel dimensions whenever possible. 3. For scalar-code-1: a) each of the statements is a slice that produces one of the streamed arrays 4. For simplicity assume scalar-code-2 is empty! (To be extended later.) ASSUME the initial kernel is (as in testsreg-tilingreg-tiling-3d.fut):

kernel map(num groups: num_groups, group size: group_size, num threads: num_threads, global TID -> global_tid, local TID -> local_tid, group ID -> group_id) (gtid_z < size_z, gtid_y < size_xy, gtid_x < size_xy) : {f32} { let {[size_com]f32 flags} = empty_or_match_cert_6685fss_6664[gtid_z, 0i32:+size_com*1i32] let {[size_com]f32 ass} = ass_6662[gtid_y, 0i32:+size_com*1i32] let {[size_com]f32 bss} = res_6687[gtid_x, 0i32:+size_com*1i32] let {f32 res_ker} = stream(size_com, size_com, fn (int chunk_size_out, int chunk_offset_6736, f32 acc_out, [chunk_size_out]f32 flags_chunk_out, [chunk_size_out]f32 ass_chunk_out, [chunk_size_out]f32 bss_chunk_out) => let {f32 res_out} = stream(chunk_size_out, 1i32, fn (int chunk_size_in, int i_6743, f32 acc_in, [chunk_size_in]f32 flags_chunk_in, [chunk_size_in]f32 ass_chunk_in, [chunk_size_in]f32 bss_chunk_in) => let {f32 f} = flags_chunk_in[0i32] let {f32 a} = ass_chunk_in[0i32] let {f32 b} = bss_chunk_in[0i32] let {bool cond} = lt32(f, 9.0f32) let {f32 tmp} = if cond then { let {f32 tmp1} = fmul32(a, b) in {tmp1} } else {0.0f32} let {f32 res_in} = fadd32(acc_in, tmp) in {res_in}, {acc_out}, flags_chunk_out, ass_chunk_out, bss_chunk_out) in {res_out}, {0.0f32}, flags, ass, bss) return {thread in space returns res_ker} }