Haskell Distributed Parallel Haskell
Haskell distributed parallel Haskell (HdpH) is a Haskell DSL for
parallel computation on distributed-memory architectures. HdpH is
implemented entirely in Haskell but does make use of a few GHC extensions,
most notably TemplateHaskell.
HdpH is described in some detail in the paper Implementing a High-level Distributed-Memory Parallel Haskell in Haskell.
The paper on The Design of SymGridPar2 paints the bigger picture of what HdpH was designed for and where future developments will lead.
This release is considered alpha stage.
Building HdpH
Should be straightforward from the cabalised package hdph
.
Note that hdph
depends on hdph-closure
, an independent package that
factors out the closure representation of HdpH.
Running HdpH
Launch via ssh
HdpH comes with a few sample applications. The simplest one,
a distributed Hello World, is good for testing whether HdpH works.
For example, the following Bourne shell command will run Hello World
on three nodes, distributed over two hosts, bwlf01
and bwlf02
.
$> for host in bwlf01 bwlf02 bwlf01; do ssh $host hdph/dist/build/hello/hello -numProcs=3 & done
This will launch three instances of the Hello World binary hello
(in
directory hdph/dist/build/hello
relative to the user's home directory),
two on bwlf01
and one on bwlf02
.
Note that the obligatory option -numProcs
must match the number of instances;
HdpH will likely hang otherwise.
Assuming that ssh
has been set up to enable password-less logins, the output
should be something like this:
Master Node:137.195.143.101:19754:0 wants to know: Who is here?
Hello from Node:137.195.143.101:19754:0
Hello from Node:137.195.143.102:38939:0
Hello from Node:137.195.143.101:30062:0
Launch via mpiexec
Shell scripts and ssh
are quite a cumbersome way of launching HdpH.
More convenient are dedicated parallel job launchers such as mpiexec
,
which comes with any recent MPI distribution.
Even though this version of HdpH does not use MPI, it can still be launched
by mpiexec
.
An MPICH installation, for example, might launch the Hello World application
with the following command:
$> mpiexec -hosts bwlf01,bwlf02,bwlf01 ./hello -numProcs=3
This will launch the Hello World binary hello
(in the current working
directory) 3 times, twice on bwlf01
and once on bwlf02
.
The expected output should look somewhat like this:
Hello from Node:137.195.143.102:12701:0
Hello from Node:137.195.143.101:15353:0
Master Node:137.195.143.101:14247:0 wants to know: Who is here?
Hello from Node:137.195.143.101:14247:0
Parallel launch via mpiexec
To test parallelism, there is a sample application computing the n
th number
of the Fibonacci sequence using a naive parallel divide-and-conquer algorithm.
For example, the following command will compute the 45th Fibonacci number in
parallel on 3 nodes, switching to sequential execution for subproblems below
a threshold of 30; a wall clock runtime of about 16 seconds is reported here.
$> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 v2 45 30
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=16.157814s}
When passed the switch -d1
HdpH will produce some summary output per
node relating to parallelism, eg. number of sparks generated per node,
maximum number of sparks residing on a node, number of times a node
requested work, and number of times work was scheduled in response.
An example run with -d1
might look like this:
$> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -d1 v2 45 30
137.195.143.101:9759:0 #SPARK=1017 max_SPARK=154 max_THREAD=[3,1]
137.195.143.101:9759:0 #FISH_sent=1 #SCHED_rcvd=0
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=16.449183s}
137.195.143.102:12551:0 #SPARK=308 max_SPARK=4 max_THREAD=[1,1]
137.195.143.102:12551:0 #FISH_sent=215 #SCHED_rcvd=214
137.195.143.103:27402:0 #SPARK=271 max_SPARK=4 max_THREAD=[0,1]
137.195.143.103:27402:0 #FISH_sent=225 #SCHED_rcvd=224
This shows that bwlf01
(IP address 137.195.143.101
) was the master node,
generating 1017 sparks in total, requesting work once (probably at the end
of the computation) and not receiving any.
The other nodes did generate some sparks themselves (308 and 271, respectively)
but they also both received a substantial number of sparks (214 and 224,
respectively).
Most importantly, each node requested work exactly once more often than
receiving it, which means no node was ever idling, except probably at the
very end of the computation.
Parallel launch via mpiexec
on homogeneous multicores
So far HdpH ran in a single thread on each node.
To make use of multicores mpiexec
can place several single-threaded nodes
on the same host, eg. the following examples will launch two and four nodes per
host, respectively; note that the number following the -n
switch of mpiexec
must match the number following the -numProcs
switch of fib
.
$> mpiexec -hosts bwlf01,bwlf02,bwlf03 -n 6 ./fib -numProcs=6 v2 45 30
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=8.026417s}
$> mpiexec -hosts bwlf01,bwlf02,bwlf03 -n 12 ./fib -numProcs=12 v2 45 30
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=4.845796s}
The other possibility is to run HdpH itself in a multi-threaded mode.
The following two examples run HdpH on two and four threads per node,
respectively, expecting that the OS will bind each thread to a core.
$> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -scheds=2 v2 45 30 +RTS -N2
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=10.648305s}
$> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -scheds=4 v2 45 30 +RTS -N4
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=6.507969s}
Note that the GHC RTS switch -N
determines the number of threads (or
HECs in GHC terminology) whereas the fib
switch -scheds
determines
how many of these threads run the HdpH scheduler loop. Whether there should
be as many schedulers as threads or less depends on the GHC version and
probably on the OS. Using one scheduler less than the number of threads
has been observed to reduce variability, and sometimes even bring
performance gains:
$> mpiexec -hosts bwlf01,bwlf02,bwlf03 ./fib -numProcs=3 -scheds=3 v2 45 30 +RTS -N4
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=5.831932s}
Parallel launch via mpiexec
on heterogeneous multicores
So far all hosts were assumed to have the same number of cores.
However, mpiexec
can be used to launch on heterogeneous clusters.
For instance, the following call will launch HdpH with two threads each on
bwlf01
and bwlf02
, and with four threads on bwlf03
; please see the man
page of mpiexec
for an explanation of the command line format.
$> mpiexec -hosts bwlf01,bwlf02,bwlf03 -n 2 ./fib -numProcs=3 -scheds=2 -d1 v2 45 30 +RTS -N2 : -n 1 ./fib -numProcs=3 -scheds=4 -d1 v2 45 30 +RTS -N4
137.195.143.101:21394:0 #SPARK=1114 max_SPARK=183 max_THREAD=[3,1,1]
137.195.143.101:21394:0 #FISH_sent=3 #SCHED_rcvd=1
137.195.143.103:22027:0 #SPARK=261 max_SPARK=2 max_THREAD=[0,1,1,1,1]
137.195.143.103:22027:0 #FISH_sent=209 #SCHED_rcvd=208
137.195.143.102:22374:0 #SPARK=221 max_SPARK=3 max_THREAD=[1,1,1]
137.195.143.102:22374:0 #FISH_sent=173 #SCHED_rcvd=172
{v2, seqThreshold=30, parThreshold=30} fib 45 = 1836311903 {runtime=10.502205s}
The switch -d1
reveals that bwlf03
(IP address 137.195.143.103
) did
indeed run 4 schedulers because its max_THREAD
list was length 5, and
the length of the max_THREAD
list is always 1 plus the number of schedulers.
Launch caveats
At startup HdpH nodes open random ports and rely on UDP multicasts to discover
each other. This results in a number of limitations:
-
UDP multicasts must be routed between all nodes (which may imply that all
hosts must reside in the same subnet).
-
The obligatory -numProcs
switch, which tells an application how many
nodes to expect, must match exactly the number of nodes launched.
-
Node discovery must be completed within a certain time frame (typically 10
seconds).
-
No second HdpH application may be launched during the node discovery phase.
Any deviation from the above limitations is likely to cause HdpH to hang
forever.
References
-
Patrick Maier, Phil Trinder.
Implementing a High-level Distributed-Memory Parallel Haskell in Haskell.
Proc. 2011 Symposium on Implementation and Application of Functional Languages (IFL 2011), Springer LNCS 7257, pp. 35-50.
-
Patrick Maier, Rob Stewart, Phil Trinder.
Reliable Scalable Symbolic Computation: The Design of SymGridPar2.
Proc. 28th ACM Symposium On Applied Computing (SAC 2013), pp. 1677-1684.
-
HdpH development repository on github.