--- layout: post title: KMeans Clustering II date: 2013-02-17 16:12 author: Ranjit Jhala published: true comments: true tags: - basic - measures demo: KMeans.hs --- **The story so far:** [Previously][kmeansI] we saw - how to encode `n`-dimensional points using plain old Haskell lists, - how to encode a matrix with `r` rows and `c` columns as a list of lists, - some basic operations on points and matrices via list-manipulating functions More importantly, we saw how easy it was to encode dimensionality with refinements over the `len` measure, thereby allowing LiquidHaskell to precisely track the dimensions across the various operations. Next, lets use the basic types and operations to develop the actual *KMeans clustering* algorithm, and, along the way, see how LiquidHaskell lets us write precise module contracts which will ward against various unpleasant *lumpenexceptions*.
30: {-# LANGUAGE ScopedTypeVariables, TypeSynonymInstances, FlexibleInstances #-}
31: 
32: module KMeans (kmeans, kmeansGen) where
33: 
34: import KMeansHelper
35: import Prelude              hiding      (zipWith)
36: import Data.List                        (sort, span, minimumBy)
37: import Data.Function                    (on)
38: import Data.Ord                         (comparing)
39: import Language.Haskell.Liquid.Prelude  (liquidAssert, liquidError)
40: 
41: instance (GHC.Classes.Eq (KMeans.WrapType [GHC.Types.Double] a))Eq (WrapType [Double] a) where
42:    (==) = x:{VV : [{VV : (GHC.Types.Double) | false}] | false}
-> y:{VV : [{VV : (GHC.Types.Double) | false}] | false}
-> {VV : (GHC.Types.Bool) | ((? Prop([VV])) <=> (x = y))}(==) ({VV : [{VV : (GHC.Types.Double) | false}] | false}
 -> {VV : [{VV : (GHC.Types.Double) | false}] | false}
 -> {VV : (GHC.Types.Bool) | false})
-> ({VV : (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) | false}
    -> {VV : [{VV : (GHC.Types.Double) | false}] | false})
-> {VV : (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) | false}
-> {VV : (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false}) | false}
-> {VV : (GHC.Types.Bool) | false}`on` (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false})
-> {VV : [{VV : (GHC.Types.Double) | false}] | false}getVect
43: 
44: instance (GHC.Classes.Ord (KMeans.WrapType [GHC.Types.Double] a))Ord (WrapType [Double] a) where
45:     compare = (GHC.Classes.Ord [GHC.Types.Double])comparing (KMeans.WrapType {VV : [{VV : (GHC.Types.Double) | false}] | false} {VV : a | false})
-> {VV : [{VV : (GHC.Types.Double) | false}] | false}getVect
Recall that we are using a modified version of an [existing KMeans implementation][URL-kmeans]. While not the swiftest implementation, it serves as a nice introduction to the algorithm, and the general invariants carry over to more sophisticated implementations. A Quick Recap ------------- Before embarking on the journey, lets remind ourselves of our destination: the goal of [K-Means clustering](http://en.wikipedia.org/wiki/K-means_clustering) is - **Take as Input** : A set of *points* represented by *n-dimensional points* in *Euclidian* space - **Return as Ouptut** : A partitioning of the points, into upto K clusters, in a manner that minimizes the sum of distances between each point and its cluster center. Last time, we introduced a variety of refinement type aliases for Haskell lists **Fixed Length Lists**
66: type List a N = {v : [a] | (len v) = N}
**Non-empty Lists**
70: type NonEmptyList a = {v : [a] | (len v) > 0}
**N-Dimensional Points**
74: type Point N = List Double N
**Matrices**
78: type Matrix a Rows Cols = List (List a Cols) Rows
We also saw several basic list operations
82: groupBy   :: (a -> a -> Bool) -> [a] -> (Clustering a)
83: partition :: PosInt -> [a] -> (Clustering a)
84: zipWith   :: (a -> b -> c) -> xs:[a] -> (List b (len xs)) -> (List c (len xs))
85: transpose :: c:Int -> r:PosInt -> Matrix a r c -> Matrix a c r
whose types will prove essential in order to verify the invariants of the clustering algorithm. You might open the [previous episode][kmeansI] in a separate tab to keep those functions handy, but fear not, we will refresh our memory about them when we get around to using them below. Generalized Points ------------------ To be more flexible, we will support *arbitrary* points as long as they can be **projected** to Euclidian space. In addition to supporting, say, an image or a [cat video][maru] as a point, this will allow us to *weight* different dimensions to different degrees. We represent generalized points with a record
104: data WrapType b a = WrapType {(KMeans.WrapType a b) -> agetVect :: b, (KMeans.WrapType a b) -> bgetVal :: a}
and we can define an alias that captures the dimensionality of the point
110: {-@ type GenPoint a N  = WrapType (Point N) a @-}
That is, `GenPoint a N` denotes a general `a` value that has an `N`-dimensional projection into Euclidean space. Algorithm: Iterative Clustering ------------------------------- Terrific, now that all the pieces are in place lets look at the KMeans algorithm. We have implemented a function `kmeans'`, which takes as input a dimension `n`, the maximum number of clusters `k` (which must be positive), a list of *generalized points* of dimension `n`, and returns a `Clustering` (i.e. a list of *non-empty lists*) of the generalized points. So much verbiage -- a type is worth a thousand comments!
128: {-@ kmeans' :: n:Int
129:             -> k:PosInt
130:             -> points:[(GenPoint a n)]
131:             -> (Clustering (GenPoint a n)) @-}
There! Crisp and to the point. Sorry. Anyhoo, the function implements the above type.
138: n:(GHC.Types.Int)
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)]
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]kmeans' (GHC.Types.Int)n {VV : (GHC.Types.Int) | (VV > 0)}k [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)]points    = (GHC.Classes.Eq [[KMeans.WrapType [GHC.Types.Double] a]])fixpoint (n:(GHC.Types.Int)
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]refineCluster {VV : (GHC.Types.Int) | (VV = n)}n) {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n)} a)] | (len([VV]) > 0)}] | (VV = initialClustering),
                                                                                                       (len([VV]) >= 0)}initialClustering
139:   where
140:     [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                      (len([VV]) = n)} a)] | (len([VV]) > 0)}]initialClustering = {VV : (GHC.Types.Int) | (VV > 0)}
-> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n)} a)]
-> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                         (len([VV]) = n)} a)] | (len([VV]) > 0)}]partition {VV : (GHC.Types.Int) | (VV = clusterSize)}clusterSize {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (VV = points),
                                                                            (len([VV]) >= 0)}points
141:     (GHC.Types.Int)clusterSize       = x:(GHC.Types.Int)
-> y:(GHC.Types.Int)
-> {VV : (GHC.Types.Int) | (VV = ((x > y) ? x : y))}max {VV : (GHC.Types.Int) | (VV = (1  :  int))}1 ((GHC.Types.Int)(xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n)} a)]
-> {VV : (GHC.Types.Int) | (VV = len([xs]))}length {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (VV = points),
                                                                            (len([VV]) >= 0)}points x:(GHC.Types.Int)
-> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x + y))}+ {VV : (GHC.Types.Int) | (VV = k),(VV > 0)}k x:(GHC.Types.Int)
-> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x - y))}- {VV : (GHC.Types.Int) | (VV = (1  :  int))}1) x:(GHC.Types.Int)
-> y:(GHC.Types.Int) -> {VV : (GHC.Types.Int) | (VV = (x / y))}`div` {VV : (GHC.Types.Int) | (VV = k),(VV > 0)}k)
142: 
143:     fixpoint          :: (Eq a) => (a -> a) -> a -> a
144:     (GHC.Classes.Eq a) -> (a -> a) -> a -> afixpoint a -> af ax      = if (GHC.Types.Bool)(a -> af {VV : a | (VV = x)}x) x:a
-> y:a -> {VV : (GHC.Types.Bool) | ((? Prop([VV])) <=> (x = y))}== {VV : a | (VV = x)}x then {VV : a | (VV = x)}x else (GHC.Classes.Eq a) -> (a -> a) -> a -> afixpoint a -> af (a -> af {VV : a | (VV = x)}x)
That is, `kmeans'` creates an `initialClustering` by `partition`-ing the `points` into chunks with `clusterSize` elements. Then, it invokes `fixpoint` to *iteratively refine* the initial clustering with `refineCluster` until it converges to a stable clustering that cannot be improved upon. This stable clustering is returned as the output. LiquidHaskell verifies that `kmeans'` adheres to the given signature in two steps. **1. Initial Clustering** First, LiquidHaskell determines from
159: max       :: (Ord a) => x:a -> y:a -> {v: a | (v >= x) && ( v >= y) }
that `clusterSize` is strictly positive, and hence, from
163: partition :: size:PosInt -> [a] -> (Clustering a)
which we saw [last time][kmeansI], that `initialClustering` is indeed a valid `Clustering` of `(GenPoint a n)`. **2. Fixpoint** Next, LiquidHaskell infers that at the call `fixpoint (refineCluster n) ...`, that the type parameter `a` of `fixpoint` can be *instantiated* with `Clustering (GenPoint a n)`. This is because `initialClustering` is a valid clustering, as we saw above, and because `refineCluster` takes -- and returns -- valid `n`-dimensional clusterings, as we shall see below. Consequently, the value returned by `kmeans'` is also `Clustering` of `GenPoint a n` as required. Refining A Clustering --------------------- Thus, the real work in KMeans happens inside `refineCluster`, which takes a clustering and improves it, like so:
186: n:(GHC.Types.Int)
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]refineCluster (GHC.Types.Int)n [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]clusters = {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n),
                                                            (len([VV]) >= 0)} a)] | (len([VV]) > 0)}] | (VV = clusters'),
                                                                                                        (len([VV]) = len([centeredGroups])),
                                                                                                        (len([VV]) >= 0)}clusters'
187:   where
188:     -- 1. Compute cluster centers
189:     {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                    (len([VV]) = n)}] | (len([VV]) = len([clusters]))}centers        = ({VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                      (len([VV]) = n)} a)] | (len([VV]) > 0)}
 -> {VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)})
-> xs:[{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n)} a)] | (len([VV]) > 0)}]
-> {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                       (len([VV]) = n)}] | (len([VV]) = len([xs]))}map (n:(GHC.Types.Int)
-> {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}clusterCenter {VV : (GHC.Types.Int) | (VV = n)}n) {VV : [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}] | (VV = clusters),
                                                                                                     (len([VV]) >= 0)}clusters
190: 
191:     -- 2. Map points to their nearest center
192:     [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                               (len([VV]) = n)} a)]points         = [[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                (len([VV]) = n)} a)]]
-> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n)} a)]concat {VV : [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}] | (VV = clusters),
                                                                                                     (len([VV]) >= 0)}clusters
193:     [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                               (len([VV]) = n),
                               (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                 (len([VV]) = n),
                                                                                                 (len([VV]) >= 0)} a))]centeredPoints = (GHC.Classes.Ord
  ([GHC.Types.Double], KMeans.WrapType [GHC.Types.Double] a))sort {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                     (len([VV]) = n),
                                     (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                       (len([VV]) = n),
                                                                                                       (len([VV]) >= 0)} a))] | (len([VV]) = len([points])),
                                                                                                                                (len([VV]) >= 0)}[({VV : [(GHC.Types.Double)] | (n = len([VV])),
                              (len([VV]) = n),
                              (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                (len([VV]) = n),
                                                                                                (len([VV]) >= 0)} a))(n:(GHC.Types.Int)
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)
-> [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}]
-> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}nearestCenter {VV : (GHC.Types.Int) | (VV = n)}n (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n)} a)p {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                    (len([VV]) = n)}] | (VV = centers),
                                                        (len([VV]) = len([clusters])),
                                                        (len([VV]) >= 0)}centers, (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n)} a)p) | p <- {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                     (len([VV]) = n)} a)] | (VV = points),
                                                                            (len([VV]) >= 0)}points]
194: 
195:     -- 3. Group points by nearest center to get new clusters
196:     [{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                      (len([VV]) = n),
                                      (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                        (len([VV]) = n),
                                                                                                        (len([VV]) >= 0)} a))] | (len([VV]) > 0)}]centeredGroups = (({VV : [(GHC.Types.Double)] | (n = len([VV])),
                               (len([VV]) = n),
                               (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                 (len([VV]) = n),
                                                                                                 (len([VV]) >= 0)} a))
 -> ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                  (len([VV]) = n),
                                  (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                    (len([VV]) = n),
                                                                                                    (len([VV]) >= 0)} a))
 -> (GHC.Types.Bool))
-> [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                  (len([VV]) = n),
                                  (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                    (len([VV]) = n),
                                                                                                    (len([VV]) >= 0)} a))]
-> [{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                         (len([VV]) = n),
                                         (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                           (len([VV]) = n),
                                                                                                           (len([VV]) >= 0)} a))] | (len([VV]) > 0)}]groupBy ((GHC.Classes.Eq [GHC.Types.Double])(==) ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                              (len([VV]) = n),
                              (len([VV]) >= 0)}
 -> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) >= 0)}
 -> (GHC.Types.Bool))
-> (({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                  (len([VV]) = n),
                                  (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                    (len([VV]) = n),
                                                                                                    (len([VV]) >= 0)} a))
    -> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                    (len([VV]) = n),
                                    (len([VV]) >= 0)})
-> ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                   (len([VV]) = n),
                                                                                                   (len([VV]) >= 0)} a))
-> ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                   (len([VV]) = n),
                                                                                                   (len([VV]) >= 0)} a))
-> (GHC.Types.Bool)`on` ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                              (len([VV]) = n),
                              (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                (len([VV]) = n),
                                                                                                (len([VV]) >= 0)} a))
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                (len([VV]) = n),
                                (len([VV]) >= 0)}fst) {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                     (len([VV]) = n),
                                     (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                       (len([VV]) = n),
                                                                                                       (len([VV]) >= 0)} a))] | (VV = centeredPoints),
                                                                                                                                (len([VV]) >= 0)}centeredPoints
197:     {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n),
                                                            (len([VV]) >= 0)} a)] | (len([VV]) > 0)}] | (len([VV]) = len([centeredGroups]))}clusters'      = ({VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                      (len([VV]) = n),
                                      (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                        (len([VV]) = n),
                                                                                                        (len([VV]) >= 0)} a))] | (len([VV]) > 0)}
 -> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                         (len([VV]) = n),
                                                         (len([VV]) >= 0)} a)] | (len([VV]) > 0)})
-> xs:[{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                            (len([VV]) = n),
                                            (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                              (len([VV]) = n),
                                                                                                              (len([VV]) >= 0)} a))] | (len([VV]) > 0)}]
-> {VV : [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                               (len([VV]) = n),
                                                               (len([VV]) >= 0)} a)] | (len([VV]) > 0)}] | (len([VV]) = len([xs]))}map ((({VV : [(GHC.Types.Double)] | (n = len([VV])),
                               (len([VV]) = n),
                               (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                 (len([VV]) = n),
                                                                                                 (len([VV]) >= 0)} a))
 -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n),
                                                  (len([VV]) >= 0)} a))
-> xs:[({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                     (len([VV]) = n),
                                     (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                       (len([VV]) = n),
                                                                                                       (len([VV]) >= 0)} a))]
-> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                        (len([VV]) = n),
                                                        (len([VV]) >= 0)} a)] | (len([VV]) = len([xs]))}map ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                              (len([VV]) = n),
                              (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                (len([VV]) = n),
                                                                                                (len([VV]) >= 0)} a))
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                 (len([VV]) = n),
                                                 (len([VV]) >= 0)} a)snd) {VV : [{VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                            (len([VV]) = n),
                                            (len([VV]) >= 0)} , (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                                                                              (len([VV]) = n),
                                                                                                              (len([VV]) >= 0)} a))] | (len([VV]) > 0)}] | (VV = centeredGroups),
                                                                                                                                                           (len([VV]) >= 0)}centeredGroups
The behavior of `refineCluster` is pithily captured by its type
203: {-@ refineCluster :: n:Int
204:                   -> Clustering (GenPoint a n)
205:                   -> Clustering (GenPoint a n) @-}
The refined clustering is computed in three steps. 1. First, we compute the `centers :: [(Point n)]` of the current `clusters`. This is achieved by using `clusterCenter`, which maps a list of generalized `n`-dimensional points to a *single* `n` dimensional point (i.e. `Point n`). 2. Next, we pair each point `p` in the list of all `points` with its `nearestCenter`. 3. Finally, the pairs in the list of `centeredPoints` are grouped by the center, i.e. the first element of the tuple. The resulting groups are projected back to the original generalized points yielding the new clustering. The type of the output follows directly from
222: groupBy :: (a -> a -> Bool) -> [a] -> (Clustering a)
from [last time][kmeansI]. At the call site above, LiquidHaskell infers that `a` can be instantiated with `((Point n), (GenPoint a n))` thereby establishing that, after *projecting away* the first element, the output is a list of non-empty lists of generalized `n`-dimensional points. That leaves us with the two crucial bits of the algorithm: `clusterCenter` and `nearestCenter`. Computing the Center of a Cluster --------------------------------- The center of an `n`-dimensional cluster is simply an `n`-dimensional point whose value in each dimension is equal to the *average* value of that dimension across all the points in the cluster. For example, consider a cluster of 2-dimensional points,
241: exampleCluster = [ [0,  0]
242:                  , [1, 10]
243:                  , [2, 20]
244:                  , [4, 40]
245:                  , [5, 50] ]
The center of the cluster is
249: exampleCenter = [ (0 + 1  + 2  + 4  + 5 ) / 5
250:                 , (0 + 10 + 20 + 40 + 50) / 5 ]
which is just
254: exampleCenter = [ 3, 30 ]
Thus, we can compute a `clusterCenter` via the following procedure
260: n:(GHC.Types.Int)
-> {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}clusterCenter (GHC.Types.Int)n {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}xs = ({VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                              (len([VV]) = numPoints),
                              (len([VV]) = len([xs])),
                              (len([VV]) > 0)}
 -> (GHC.Types.Double))
-> xs:[{VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                                    (len([VV]) = numPoints),
                                    (len([VV]) = len([xs])),
                                    (len([VV]) > 0)}]
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([xs]))}map {VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                             (len([VV]) = numPoints),
                             (len([VV]) = len([xs])),
                             (len([VV]) > 0)}
-> (GHC.Types.Double)average {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = numPoints)}] | (v = xs'),
                                                               (len([v]) = n),
                                                               (len([v]) >= 0)}xs'
261:   where
262:     {VV : (GHC.Types.Int) | (VV = len([xs]))}numPoints      = xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n)} a)]
-> {VV : (GHC.Types.Int) | (VV = len([xs]))}length {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (v = xs),
                                                                           (len([v]) > 0),
                                                                           (len([v]) >= 0)}xs
263:     {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = numPoints)}] | (len([v]) = n)}xs'            = c:(GHC.Types.Int)
-> r:{VV : (GHC.Types.Int) | (VV > 0)}
-> {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = c)}] | (len([v]) = r)}
-> {v : [{VV : [(GHC.Types.Double)] | (len([VV]) = r)}] | (len([v]) = c)}transpose {VV : (GHC.Types.Int) | (VV = n)}n {VV : (GHC.Types.Int) | (VV = numPoints),(VV = len([xs]))}numPoints (((KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                               (len([VV]) = n)} a)
 -> {VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)})
-> xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                     (len([VV]) = n)} a)]
-> {VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                       (len([VV]) = n)}] | (len([VV]) = len([xs]))}map (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n)} a)
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}getVect {v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (v = xs),
                                                                           (len([v]) > 0),
                                                                           (len([v]) >= 0)}xs)
264: 
265:     average        :: [Double] -> Double
266:     {VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                             (len([VV]) = numPoints),
                             (len([VV]) = len([xs])),
                             (len([VV]) > 0)}
-> (GHC.Types.Double)average        = ((GHC.Types.Double)
-> {VV : (GHC.Types.Int) | (VV > 0)} -> (GHC.Types.Double)`safeDiv` {VV : (GHC.Types.Int) | (VV = numPoints),(VV = len([xs]))}numPoints) ((GHC.Types.Double) -> (GHC.Types.Double))
-> ({VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                                 (len([VV]) = numPoints),
                                 (len([VV]) = len([xs])),
                                 (len([VV]) > 0)}
    -> (GHC.Types.Double))
-> {VV : [(GHC.Types.Double)] | (numPoints = len([VV])),
                                (len([VV]) = numPoints),
                                (len([VV]) = len([xs])),
                                (len([VV]) > 0)}
-> (GHC.Types.Double). [(GHC.Types.Double)] -> (GHC.Types.Double)sum
First, we `transpose` the matrix of points in the cluster. Suppose that `xs` is the `exampleCluster` from above (and so `n` is `2` and `numPoints` is `5`.) In this scenario, `xs'` is
274: xs' = [ [0,  1,  2,  4,  5]
275:       , [0, 10, 20, 40, 50] ]
and so `map average xs'` evaluates to `exampleCenter` from above. We have ensured that the division in the average does not lead to any nasty surprises via a *safe division* function whose precondition checks that the denominator is non-zero, [as illustrated here][ref101].
285: {- safeDiv   :: (Fractional a) => a -> {v:Int | v != 0} -> a -}
286: safeDiv     :: (Fractional a) => a -> Int -> a
287: (GHC.Real.Fractional a)
-> a -> {VV : (GHC.Types.Int) | (VV > 0)} -> asafeDiv an 0 = {VV : [(GHC.Types.Char)] | false} -> {VV : a | false}liquidError {VV : [(GHC.Types.Char)] | (len([VV]) >= 0)}"divide by zero"
288: safeDiv n d = {VV : a | (VV = n)}n x:a -> y:{VV : a | (VV != 0)} -> {VV : a | (VV = (x / y))}/ ((GHC.Num.Num a)fromIntegral {VV : (GHC.Types.Int) | (VV > 0)}d)
LiquidHaskell verifies that the divide-by-zero never occurs, and furthermore, that `clusterCenter` indeed computes an `n`-dimensional center by inferring that
295: {-@ clusterCenter :: n:Int -> NonEmptyList (GenPoint a n) -> Point n @-}
LiquidHaskell deduces that the *input* list of points `xs` is non-empty from the fact that `clusterCenter` is only invoked on the elements of a `Clustering` which comprise only non-empty lists. Since `xs` is non-empty, i.e. `(len xs) > 0`, LiquidHaskell infers that `numPoints` is positive (hover over `length` to understand why), and hence, LiquidHaskell is satisfied that the call to `safeDiv` will always proceed without any incident. To establish the *output* type `Point n` LiquidHaskell leans on the fact that
307: transpose :: n:Int -> m:PosInt -> Matrix a m n -> Matrix a n m
to deduce that `xs'` is of type `Matrix Double n numPoints`, that is to say, a list of length `n` containing lists of length `numPoints`. Since `map` preserves the length, the value `map average xs'` is also a list of length `n`, i.e. `Point n`. Finding the Nearest Center -------------------------- The last piece of the puzzle is `nearestCenter` which maps each (generalized) point to the center that it is nearest. The code is pretty self-explanatory:
324: nearestCenter     :: Int -> WrapType [Double] a -> [[Double]] -> [Double]
325: n:(GHC.Types.Int)
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)
-> [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}]
-> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}nearestCenter (GHC.Types.Int)n (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)x = {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                     (len([VV]) = n),
                                     (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) >= 0)}
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                (len([VV]) = n),
                                (len([VV]) >= 0)}minKey ({VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                      (len([VV]) = n),
                                      (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) >= 0)}
 -> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) >= 0)})
-> ([{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}]
    -> {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                            (len([VV]) = n),
                                            (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) >= 0)})
-> [{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}]
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                (len([VV]) = n),
                                (len([VV]) >= 0)}. ({VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}
 -> ({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                  (len([VV]) = n),
                                  (len([VV]) >= 0)} , (GHC.Types.Double)))
-> xs:[{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                    (len([VV]) = n)}]
-> {VV : [({VV : [(GHC.Types.Double)] | (n = len([VV])),
                                        (len([VV]) = n),
                                        (len([VV]) >= 0)} , (GHC.Types.Double))] | (len([VV]) = len([xs]))}map (c:{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}
-> ({VV : [(GHC.Types.Double)] | (VV = c),
                                 (n = len([VV])),
                                 (len([VV]) = n),
                                 (len([VV]) = len([c])),
                                 (len([VV]) >= 0)} , (GHC.Types.Double))\{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}c -> x1:a -> b -> (a , b)({VV : [(GHC.Types.Double)] | (VV = c),
                             (n = len([VV])),
                             (len([VV]) = n),
                             (len([VV]) >= 0)}c, a:{VV : [(GHC.Types.Double)] | (len([VV]) >= 0)}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                (len([VV]) >= 0)}
-> (GHC.Types.Double)distance {VV : [(GHC.Types.Double)] | (VV = c),
                             (n = len([VV])),
                             (len([VV]) = n),
                             (len([VV]) >= 0)}c ((KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n),
                                              (len([VV]) = len([c])),
                                              (len([VV]) >= 0)} a)
-> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                (len([VV]) = n),
                                (len([VV]) = len([c])),
                                (len([VV]) >= 0)}getVect {VV : (KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a) | (VV = x)}x)))
We `map` the centers to a tuple of center `c` and the `distance` between `x` and `c`, and then we select the tuple with the smallest distance
332: minKey  :: (Ord v) => [(k, v)] -> k
333: (GHC.Classes.Ord a) -> {VV : [(b , a)] | (len([VV]) >= 0)} -> bminKey  = (a , b) -> afst ((a , b) -> a)
-> ({VV : [(a , b)] | (len([VV]) >= 0)} -> (a , b))
-> {VV : [(a , b)] | (len([VV]) >= 0)}
-> a. ((a , b) -> (a , b) -> (GHC.Types.Ordering))
-> [(a , b)] -> (a , b)minimumBy ((a , b) -> (a , b) -> (GHC.Types.Ordering)\(a , b)x (a , b)y -> x:a
-> y:a
-> {VV : (GHC.Types.Ordering) | ((VV = EQ) <=> (x = y)),
                                ((VV = GT) <=> (x > y)),
                                ((VV = LT) <=> (x < y))}compare ((a , b) -> bsnd {VV : (a , b) | (VV = x)}x) ((a , b) -> bsnd {VV : (a , b) | (VV = y)}y))
The interesting bit is that the `distance` function uses `zipWith` to ensure that the dimensionality of the center and the point match up.
340: distance     :: [Double] -> [Double] -> Double
341: a:{VV : [(GHC.Types.Double)] | (len([VV]) >= 0)}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                (len([VV]) >= 0)}
-> (GHC.Types.Double)distance {VV : [(GHC.Types.Double)] | (len([VV]) >= 0)}a {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                             (len([VV]) >= 0)}b = (GHC.Types.Double) -> (GHC.Types.Double)sqrt ((GHC.Types.Double) -> (GHC.Types.Double))
-> ({VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                 (len([VV]) = len([b])),
                                 (len([VV]) >= 0)}
    -> (GHC.Types.Double))
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                (len([VV]) = len([b])),
                                (len([VV]) >= 0)}
-> (GHC.Types.Double). [(GHC.Types.Double)] -> (GHC.Types.Double)sum ({VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                              (len([VV]) = len([b])),
                              (len([VV]) >= 0)}
 -> (GHC.Types.Double))
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([a])),
                                (len([VV]) = len([b])),
                                (len([VV]) >= 0)}
-> (GHC.Types.Double)$ ((GHC.Types.Double) -> (GHC.Types.Double) -> (GHC.Types.Double))
-> xs:[(GHC.Types.Double)]
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([xs]))}
-> {VV : [(GHC.Types.Double)] | (len([VV]) = len([xs]))}zipWith ((GHC.Types.Double) -> (GHC.Types.Double) -> (GHC.Types.Double)\(GHC.Types.Double)v1 (GHC.Types.Double)v2 -> {VV : (GHC.Integer.Type.Integer) | (VV = 2)}({VV : (GHC.Types.Double) | (VV = v1)}v1 x:(GHC.Types.Double)
-> y:(GHC.Types.Double)
-> {VV : (GHC.Types.Double) | (VV = (x - y))}- {VV : (GHC.Types.Double) | (VV = v2)}v2) (GHC.Types.Double)
-> (GHC.Integer.Type.Integer) -> (GHC.Types.Double)^ 2) {VV : [(GHC.Types.Double)] | (VV = a),(len([VV]) >= 0)}a {VV : [(GHC.Types.Double)] | (VV = b),
                             (len([VV]) = len([a])),
                             (len([VV]) >= 0)}b
LiquidHaskell verifies `distance` by inferring that
347: {-@ nearestCenter :: n:Int -> (GenPoint a n) -> [(Point n)] -> (Point n) @-}
First, LiquidHaskell deduces that each center in `cs` is indeed `n`-dimensional, which follows from the output type of `clusterCenter`. Since `x` is a `(GenPoint a n)` LiquidHaskell infers that both `c` and `getVect x` are of an equal length `n`. Consequently, the call to
355: zipWith :: (a -> b -> c) -> xs:[a] -> (List b (len xs)) -> (List c (len xs))
[discussed last time][kmeansI] is determined to be safe. Finally, the value returned is just one of the input centers and so is a `(Point n)`. Putting It All Together: Top-Level API -------------------------------------- We can bundle the algorithm into two top-level API functions. First, a version that clusters *generalized* points. In this case, we require a function that can `project` an `a` value to an `n`-dimensional point. This function just wraps each `a`, clusters via `kmeans'` and then unwraps the points.
374: {-@ kmeansGen :: n:Int
375:               -> (a -> (Point n))
376:               -> k:PosInt
377:               -> xs:[a]
378:               -> (Clustering a)
379:   @-}
380: 
381: n:(GHC.Types.Int)
-> (a -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)})
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [a]
-> [{VV : [a] | (len([VV]) > 0)}]kmeansGen (GHC.Types.Int)n a -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}project {VV : (GHC.Types.Int) | (VV > 0)}k = ({VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                      (len([VV]) = n)} a)] | (len([VV]) > 0)}
 -> {VV : [a] | (len([VV]) > 0)})
-> xs:[{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n)} a)] | (len([VV]) > 0)}]
-> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) = len([xs]))}map (((KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                               (len([VV]) = n)} a)
 -> a)
-> xs:[(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                     (len([VV]) = n)} a)]
-> {VV : [a] | (len([VV]) = len([xs]))}map (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                              (len([VV]) = n)} a)
-> agetVal)
382:                       ([{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                       (len([VV]) = n)} a)] | (len([VV]) > 0)}]
 -> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)})
-> ([a]
    -> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                             (len([VV]) = n)} a)] | (len([VV]) > 0)}])
-> [a]
-> {VV : [{VV : [a] | (len([VV]) > 0)}] | (len([VV]) >= 0)}. n:(GHC.Types.Int)
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)]
-> [{v : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (len([VV]) = n)} a)] | (len([v]) > 0)}]kmeans' {VV : (GHC.Types.Int) | (VV = n)}n {VV : (GHC.Types.Int) | (VV = k),(VV > 0)}k
383:                       ({VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                      (len([VV]) = n),
                                                      (len([VV]) >= 0)} a)] | (len([VV]) >= 0)}
 -> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                          (len([VV]) = n)} a)] | (len([VV]) > 0)}])
-> ([a]
    -> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                            (len([VV]) = n),
                                                            (len([VV]) >= 0)} a)] | (len([VV]) >= 0)})
-> [a]
-> [{VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                         (len([VV]) = n)} a)] | (len([VV]) > 0)}]. (a
 -> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                  (len([VV]) = n),
                                                  (len([VV]) >= 0)} a))
-> xs:[a]
-> {VV : [(KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                        (len([VV]) = n),
                                                        (len([VV]) >= 0)} a)] | (len([VV]) = len([xs]))}map (x:a
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                 (len([VV]) = n),
                                                 (len([VV]) >= 0)} {VV : a | (VV = x)})\ax -> {VV : [(GHC.Types.Double)] | (n = len([VV])),
                             (len([VV]) = n),
                             (len([VV]) >= 0)}
-> {VV : a | (VV = x)}
-> (KMeans.WrapType {VV : [(GHC.Types.Double)] | (n = len([VV])),
                                                 (len([VV]) = n),
                                                 (len([VV]) >= 0)} {VV : a | (VV = x)})WrapType (a -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)}project {VV : a | (VV = x)}x) {VV : a | (VV = x)}x)
Second, a specialized version that operates directly on `n`-dimensional points. The specialized version just calls the general version with a trivial `id` projection.
391: {-@ kmeans :: n:Int
392:            -> k:PosInt
393:            -> points:[(Point n)]
394:            -> (Clustering (Point n))
395:   @-}
396: 
397: n:(GHC.Types.Int)
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}]
-> [{v : [{VV : [(GHC.Types.Double)] | (len([VV]) = n)}] | (len([v]) > 0)}]kmeans (GHC.Types.Int)n   = n:(GHC.Types.Int)
-> ({VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}
    -> {VV : [(GHC.Types.Double)] | (len([VV]) = n)})
-> {VV : (GHC.Types.Int) | (VV > 0)}
-> [{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}]
-> [{VV : [{VV : [(GHC.Types.Double)] | (n = len([VV])),
                                        (len([VV]) = n)}] | (len([VV]) > 0)}]kmeansGen {VV : (GHC.Types.Int) | (VV = n)}n x:{VV : [(GHC.Types.Double)] | (n = len([VV])),(len([VV]) = n)}
-> {VV : [(GHC.Types.Double)] | (VV = x),
                                (n = len([VV])),
                                (len([VV]) = n)}id
Conclusions ----------- I hope that over the last two posts you have gotten a sense of 1. What KMeans clustering is all about, 2. How measures and refinements can be used to describe the behavior of common list operations like `map`, `transpose`, `groupBy`, `zipWith`, and so on, 3. How LiquidHaskell's automated inference makes it easy to write and verify invariants of non-trivial programs. The sharp reader will have noticed that the one *major*, non syntactic, change to the [original code][URL-kmeans] is the addition of the dimension parameter `n` throughout the code. This is critically required so that we can specify the relevant invariants (which are in terms of `n`.) The value is actually a ghost, and never ever used. Fortunately, Haskell's laziness means that we don't have to worry about it (or other ghost variables) imposing any run-time overhead at all. **Exercise:** Incidentally, if you have followed thus far I would encourage you to ponder about how you might modify the types (and implementation) to verify that KMeans indeed produces at most `k` clusters... [ref101]: /blog/2013/01/01/refinement-types-101.lhs/ [safeList]: /blog/2013/01/31/safely-catching-a-list-by-its-tail.lhs/ [kmeansI]: /blog/2013/02/16/kmeans-clustering-I.lhs/ [kmeansII]: /blog/2013/02/17/kmeans-clustering-II.lhs/ [URL-take]: https://github.com/ucsd-progsys/liquidhaskell/blob/master/include/GHC/List.lhs#L334 [URL-groupBy]: http://hackage.haskell.org/packages/archive/base/latest/doc/html/Data-List.html#v:groupBy [URL-transpose]: http://hackage.haskell.org/packages/archive/base/latest/doc/html/src/Data-List.html#transpose [maru]: http://www.youtube.com/watch?v=8uDuls5TyNE [demo]: http://goto.ucsd.edu/~rjhala/liquid/haskell/demo/#?demo=KMeansHelper.hs [URL-kmeans]: http://hackage.haskell.org/package/kmeans