Safe Haskell | None |
---|---|
Language | Haskell2010 |
A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. . See original paper: "Computing extremely accurate quantiles using t-digest" by Ted Dunning and Otmar Ertl for more details https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf.
Examples
>>>
quantile 0.99 (tdigest [1..1000] :: TDigest 25)
Just 990.5
>>>
quantile 0.99 (tdigest [1..1000] :: TDigest 3)
Just 989.0...
t-Digest is more precise in tails, especially median is imprecise:
>>>
median (forceCompress $ tdigest [1..1000] :: TDigest 25)
Just 497.6...
Semigroup
This operation isn't strictly associative, but statistical variables shouldn't be affected.
>>>
let td xs = tdigest xs :: TDigest 10
>>>
median (td [1..500] <> (td [501..1000] <> td [1001..1500]))
Just 802...
>>>
median ((td [1..500] <> td [501..1000]) <> td [1001..1500])
Just 726...
The linear is worst-case scenario:
>>>
let td' xs = tdigest (fairshuffle xs) :: TDigest 10
>>>
median (td' [1..500] <> (td' [501..1000] <> td' [1001..1500]))
Just 750.3789...
>>>
median ((td' [1..500] <> td' [501..1000]) <> td' [1001..1500])
Just 750.3789...
Synopsis
- data TDigest (compression :: Nat)
- tdigest :: (Foldable f, KnownNat comp) => f Double -> TDigest comp
- singleton :: KnownNat comp => Double -> TDigest comp
- insert :: KnownNat comp => Double -> TDigest comp -> TDigest comp
- insert' :: KnownNat comp => Double -> TDigest comp -> TDigest comp
- compress :: forall comp. KnownNat comp => TDigest comp -> TDigest comp
- forceCompress :: forall comp. KnownNat comp => TDigest comp -> TDigest comp
- minimumValue :: TDigest comp -> Mean
- maximumValue :: TDigest comp -> Mean
- median :: TDigest comp -> Maybe Double
- quantile :: Double -> TDigest comp -> Maybe Double
- mean :: TDigest comp -> Maybe Double
- variance :: TDigest comp -> Maybe Double
- stddev :: TDigest comp -> Maybe Double
- cdf :: Double -> TDigest comp -> Double
- icdf :: Double -> TDigest comp -> Maybe Double
- size :: TDigest comp -> Int
- valid :: TDigest comp -> Bool
- validate :: TDigest comp -> Either String (TDigest comp)
- debugPrint :: TDigest comp -> IO ()
Construction
data TDigest (compression :: Nat) Source #
TDigest
is a tree of centroids.
compression
is a 1/δ
. The greater the value of compression
the less
likely value merging will happen.
Instances
KnownNat comp => Reducer Double (TDigest comp) Source # | |
Show (TDigest compression) Source # | |
KnownNat comp => Semigroup (TDigest comp) Source # | |
KnownNat comp => Monoid (TDigest comp) Source # | |
KnownNat comp => Binary (TDigest comp) Source # |
|
NFData (TDigest comp) Source # |
|
Defined in Data.TDigest.Tree.Internal | |
HasHistogram (TDigest comp) Maybe Source # | |
Population
Insert single value into TDigest
.
Compression
>>>
let digest = foldl' (flip insert') mempty [0..1000] :: TDigest 10
>>>
(size digest, size $ compress digest)
(1001,52)
>>>
(quantile 0.1 digest, quantile 0.1 $ compress digest)
(Just 99.6...,Just 89.7...)
Note: when values are inserted in more random order, t-Digest self-compresses on the fly:
>>>
let digest = foldl' (flip insert') mempty (fairshuffle [0..1000]) :: TDigest 10
>>>
(size digest, size $ compress digest, size $ forceCompress digest)
(78,78,48)
>>>
quantile 0.1 digest
Just 98.9...
compress :: forall comp. KnownNat comp => TDigest comp -> TDigest comp Source #
Compress TDigest
.
Reinsert the centroids in "better" order (in original paper: in random) so they have opportunity to merge.
Compression will happen only if size is both:
bigger than
and bigger than relMaxSize
* compabsMaxSize
.
forceCompress :: forall comp. KnownNat comp => TDigest comp -> TDigest comp Source #
Perform compression, even if current size says it's not necessary.
Statistics
minimumValue :: TDigest comp -> Mean Source #
Center of left-most centroid. Note: may be different than min element inserted.
>>>
minimumValue (tdigest [1..100] :: TDigest 3)
1.0
maximumValue :: TDigest comp -> Mean Source #
Center of right-most centroid. Note: may be different than max element inserted.
>>>
maximumValue (tdigest [1..100] :: TDigest 3)
99.0
Percentile
Mean & Variance
- - >>> stddev (tdigest $ fairshuffle [0..100] :: TDigest 10) Just 29.1...
mean :: TDigest comp -> Maybe Double Source #
Mean.
>>>
mean (tdigest [1..100] :: TDigest 10)
Just 50.5
Note: if you only need the mean, calculate it directly.
CDF
cdf :: Double -> TDigest comp -> Double Source #
Cumulative distribution function.
Note: if this is the only thing you need, it's more efficient to count this directly.