> {-# LANGUAGE ViewPatterns #-} A parallel implementation of Lloyd's algorithm for k-means clustering, adapted from Marlow's _Parallel and Concurrent Programming in Haskell_. Here we use Evaluation Strategies to parallelise the assignment of points to clusters: > module Algorithms.Lloyd.Strategies ( > Point(..), > Cluster(..), > ExpectDivergent(..), > Partitions(..), > kmeans, > step > ) where > > import Prelude hiding (zipWith, foldr1, map) > import Control.Parallel.Strategies (Strategy(..), parTraversable, using, rseq) > import Data.Foldable (Foldable(foldr1)) > import Data.Functor.Extras ((..:)) > import Data.Metric (Metric(..)) > import Data.Semigroup (Semigroup(..)) > import Data.Vector (Vector(..), zipWith, map) > import Data.Vector.Split (chunkInto) > import Algorithms.Lloyd.Sequential (Cluster(..), Point(..), ExpectDivergent(..), PointSum(..), makeNewClusters, assignPS, assign) We can combine two vectors of some same type $t$ provided we know how to combine two $t$s: > (<><>) :: Semigroup a => Vector a -> Vector a -> Vector a > (<><>) = zipWith (<>) (We can't make $Vector a$ an instance of $Semigroup$, because it already is. But that instance has $(<>) = (++)$, so we can't use it here.) Step is modified to, given a partitioned list of points, perform classification in parallel: > step :: Metric a => (Vector Double -> a) -> Vector Cluster -> Vector (Vector Point) -> Vector Cluster > step = makeNewClusters . foldr1 (<><>) . with (parTraversable rseq) ..: fmap ..: assignPS > > with :: Strategy a -> a -> a > with = flip using This version of k-means takes an additional arguments -- the number of partitions the set of points'll be divided into. This needn't equal the number of processors: if there are more spark than cores, the runtime can be trusted to schedule unallocated sparks so soon as a core becomes available. That said: if there are too many work items, the overhead of recombination may exceed the speed-up provided by parallellism; if there are too few items, and those items vary in cost, some of our cores may be unused for part of the computation. > newtype Partitions = Partitions { partitions :: Int } > > computeClusters :: Metric a => ExpectDivergent -> (Vector Double -> a) -> Partitions -> Vector Point -> Vector Cluster -> Vector Cluster > computeClusters (expectDivergent -> expectDivergent) metric = computeClusters' expectDivergent metric 0 ..: chunkInto . partitions > > computeClusters' :: Metric a => Int -> (Vector Double -> a) -> Int -> Vector (Vector Point) -> Vector Cluster -> Vector Cluster > computeClusters' expectDivergent metric iterations points clusters > | iterations >= expectDivergent = clusters > | clusters' == clusters = clusters > | otherwise = computeClusters' expectDivergent metric (succ iterations) points clusters' > where clusters' = step metric clusters points > > kmeans :: Metric a => ExpectDivergent -> (Vector Double -> a) -> Partitions -> Vector Point -> Vector Cluster -> Vector (Vector Point) > kmeans expectDivergent metric chunks points initial = assign metric clusters points > where clusters = computeClusters expectDivergent metric chunks points initial