hierarchical-clustering-0.4.7: Fast algorithms for single, average/UPGMA and complete linkage clustering.

Safe HaskellNone
LanguageHaskell98

Data.Clustering.Hierarchical

Contents

Synopsis

Dendrogram data type

data Dendrogram a Source #

Data structure for storing hierarchical clusters. The distance between clusters is stored on the branches. Distances between leafs are the distances between the elements on those leafs, while distances between branches are defined by the linkage used (see Linkage).

Constructors

Leaf a

The leaf contains the item a itself.

Branch !Distance (Dendrogram a) (Dendrogram a)

Each branch connects two clusters/dendrograms that are d distance apart.

Instances
Functor Dendrogram Source #

Does not recalculate the distances!

Instance details

Defined in Data.Clustering.Hierarchical.Internal.Types

Methods

fmap :: (a -> b) -> Dendrogram a -> Dendrogram b #

(<$) :: a -> Dendrogram b -> Dendrogram a #

Foldable Dendrogram Source # 
Instance details

Defined in Data.Clustering.Hierarchical.Internal.Types

Methods

fold :: Monoid m => Dendrogram m -> m #

foldMap :: Monoid m => (a -> m) -> Dendrogram a -> m #

foldr :: (a -> b -> b) -> b -> Dendrogram a -> b #

foldr' :: (a -> b -> b) -> b -> Dendrogram a -> b #

foldl :: (b -> a -> b) -> b -> Dendrogram a -> b #

foldl' :: (b -> a -> b) -> b -> Dendrogram a -> b #

foldr1 :: (a -> a -> a) -> Dendrogram a -> a #

foldl1 :: (a -> a -> a) -> Dendrogram a -> a #

toList :: Dendrogram a -> [a] #

null :: Dendrogram a -> Bool #

length :: Dendrogram a -> Int #

elem :: Eq a => a -> Dendrogram a -> Bool #

maximum :: Ord a => Dendrogram a -> a #

minimum :: Ord a => Dendrogram a -> a #

sum :: Num a => Dendrogram a -> a #

product :: Num a => Dendrogram a -> a #

Traversable Dendrogram Source # 
Instance details

Defined in Data.Clustering.Hierarchical.Internal.Types

Methods

traverse :: Applicative f => (a -> f b) -> Dendrogram a -> f (Dendrogram b) #

sequenceA :: Applicative f => Dendrogram (f a) -> f (Dendrogram a) #

mapM :: Monad m => (a -> m b) -> Dendrogram a -> m (Dendrogram b) #

sequence :: Monad m => Dendrogram (m a) -> m (Dendrogram a) #

Eq a => Eq (Dendrogram a) Source # 
Instance details

Defined in Data.Clustering.Hierarchical.Internal.Types

Methods

(==) :: Dendrogram a -> Dendrogram a -> Bool #

(/=) :: Dendrogram a -> Dendrogram a -> Bool #

Ord a => Ord (Dendrogram a) Source # 
Instance details

Defined in Data.Clustering.Hierarchical.Internal.Types

Show a => Show (Dendrogram a) Source # 
Instance details

Defined in Data.Clustering.Hierarchical.Internal.Types

type Distance = Double Source #

A distance is simply a synonym of Double for efficiency.

elements :: Dendrogram a -> [a] Source #

List of elements in a dendrogram.

cutAt :: Dendrogram a -> Distance -> [Dendrogram a] Source #

dendro `cutAt` threshold cuts the dendrogram dendro at all branches which have distances strictly greater than threshold.

For example, suppose we have

dendro = Branch 0.8
           (Branch 0.5
             (Branch 0.2
               (Leaf 'A')
               (Leaf 'B'))
             (Leaf 'C'))
           (Leaf 'D')

Then:

dendro `cutAt` 0.9 == dendro `cutAt` 0.8 == [dendro] -- no changes
dendro `cutAt` 0.7 == dendro `cutAt` 0.5 == [Branch 0.5 (Branch 0.2 (Leaf 'A') (Leaf 'B')) (Leaf 'C'), Leaf 'D']
dendro `cutAt` 0.4 == dendro `cutAt` 0.2 == [Branch 0.2 (Leaf 'A') (Leaf 'B'), Leaf 'C', Leaf 'D']
dendro `cutAt` 0.1 == [Leaf 'A', Leaf 'B', Leaf 'C', Leaf 'D'] -- no branches at all

Linkage data type

data Linkage Source #

The linkage type determines how the distance between clusters will be calculated. These are the linkage types currently available on this library.

Constructors

SingleLinkage

The distance between two clusters a and b is the minimum distance between an element of a and an element of b.

CompleteLinkage

The distance between two clusters a and b is the maximum distance between an element of a and an element of b.

CLINK

The same as CompleteLinkage, but using the CLINK algorithm. It's much faster however doesn't always give the best complete linkage dendrogram.

UPGMA

Unweighted Pair Group Method with Arithmetic mean, also called "average linkage". The distance between two clusters a and b is the arithmetic average between the distances of all elements in a to all elements in b.

FakeAverageLinkage

This method is usually wrongly called "average linkage". The distance between cluster a = a1 U a2 (that is, cluster a was formed by the linkage of clusters a1 and a2) and an old cluster b is (d(a1,b) + d(a2,b)) / 2. So when clustering two elements to create a cluster, this method is the same as UPGMA. However, in general when joining two clusters this method assigns equal weights to a1 and a2, while UPGMA assigns weights proportional to the number of elements in each cluster. See, for example:

Clustering function

dendrogram Source #

Arguments

:: Linkage

Linkage type to be used.

-> [a]

Items to be clustered.

-> (a -> a -> Distance)

Distance function between items.

-> Dendrogram a

Complete dendrogram.

Calculates a complete, rooted dendrogram for a list of items and a linkage type. The following are the time and space complexities for each linkage:

SingleLinkage
O(n^2) time and O(n) space, using the SLINK algorithm. This algorithm is optimal in both space and time and gives the same answer as the naive algorithm using a distance matrix.
CompleteLinkage
O(n^3) time and O(n^2) space, using the naive algorithm with a distance matrix. Use CLINK if you need more performance.
Complete linkage with CLINK
O(n^2) time and O(n) space, using the CLINK algorithm. Note that this algorithm doesn't always give the same answer as the naive algorithm using a distance matrix, but it's much faster.
UPGMA
O(n^3) time and O(n^2) space, using the naive algorithm with a distance matrix.
FakeAverageLinkage
O(n^3) time and O(n^2) space, using the naive algorithm with a distance matrix.