javelin-0.1.2.0: Labeled one-dimensional arrays
Safe HaskellSafe-Inferred
LanguageGHC2021

Data.Series.Tutorial

Synopsis

    Introduction

    This is a short user guide on how to get started using javelin and its various modules.

    The central data structure at the heart of this package is the Series. A Series k a is a labeled array of type v filled with values of type a, indexed by keys of type k.

    Like Map, Series support efficient:

    • random access by key ( \(O(\log n)\) );
    • slice by key ( \(O(\log n)\) ).

    Like Vector, Series support efficient:

    • numerical operations.
    • random access by index ( \(O(1)\) );
    • slice by index ( \(O(1)\) );

    To follow along this tutorial, the following imports are expected:

    >>> import Data.Series as Series
    

    Construction

    The easiest way to create a Series is to do it from a list using fromList:

    >>> Series.fromList [ ('a', 1::Int), ('b', 2), ('c', 3), ('d', 4) ]
    index | values
    ----- | ------
      'a' |      1
      'b' |      2
      'c' |      3
      'd' |      4
    

    Note what happens when we have the same key ('a') attached to multiple values:

    >>> Series.fromList [ ('a', 1::Int), ('a', 0), ('b', 2), ('c', 3), ('d', 4) ]
    index | values
    ----- | ------
      'a' |      0
      'b' |      2
      'c' |      3
      'd' |      4
    

    Series, like Maps, have unique keys; therefore, the output series may not be the same length as the input series. See further below for an explanation of how to handle duplicate keys.

    Since Series are like Map, it's easy to convert between the two:

    >>> let mp = Data.Map.Strict.fromList [ ('a', 0::Int), ('a', 1), ('b', 2), ('c', 3), ('d', 4) ]
    >>> mp
    fromList [('a',1),('b',2),('c',3),('d',4)]
    >>> Series.fromStrictMap mp
    index | values
    ----- | ------
      'a' |      1
      'b' |      2
      'c' |      3
      'd' |      4
    

    Of course, fromLazyMap is also available. In fact, conversion to/from Series is supported for many types; see the IsSeries typeclass and its methods, toSeries and fromSeries.

    Index

    Series have two components: values and an index.

    The index (of type Index k) is an ordered set of unique elements which allows to determine where are each values in the series. Since all keys in an Index are unique and sorted, it is fast to find the value associated to any random key.

    As we'll see soon, Index is an important data structure which can be used to slice through a Series, so let's get comfortable with them.

    >>> import qualified Data.Series.Index as Index
    

    An Index can be constructed from a list:

    >>> Index.fromList [5::Int,5,4,3,2,1,5,5,5]
    Index [1,2,3,4,5]
    

    As you see above, repeated elements (in this case, 5) won't be repeated in the Index. Therefore, it often makes more sense to construct an Index using fromSet from a Set from Data.Set.

    One common way to construct an Index is to programmatically unfold a seed value using unfoldr. Below, we want to generate numbers from 7 down to 1:

    >>> Index.unfoldr (\x -> if x < 1 then Nothing else Just (x, x-1)) (7 :: Int)
    Index [1,2,3,4,5,6,7]
    

    This task is so common that there is a convenience function to create ranges, range. For example, if you want to create an Index of values starting at 1 and ending at 10, in steps of 3:

    >>> Index.range (+3) (1 :: Int) 10
    Index [1,4,7,10]
    

    An Index is very much like a Set, so you can

    and more.

    Selections

    Single-key selection

    Single-element selections are performed using at, which selects a single element by key. at is safe; if the key is missing, Nothing is returned:

    >>> let xs = Series.fromList [ ('a', 1::Int), ('b', 2), ('c', 3), ('d', 4) ]
    >>> xs
    index | values
    ----- | ------
      'a' |      1
      'b' |      2
      'c' |      3
      'd' |      4
    >>> xs `at` 'a'
    Just 1
    >>> xs `at` 'z'
    Nothing
    

    Bulk selections

    Bulk selection, also known as *slicing*, is the method by which we extract a sub-series from a series. In the examples below, we'll assume that we have the series aapl_close is available in-scope, which represents the closing price of Apple stock:

    >>> :{
    let aapl_close = Series.fromList [ ("2010-01-04", 6.5522 :: Double)
                                     , ("2010-01-05", 6.5636)
                                     , ("2010-01-06", 6.4592)
                                     , ("2010-01-07", 6.4472)
                                     , ("2010-01-08", 6.4901)
                                     -- No prices during the weekend
                                     , ("2010-01-11", 6.5152)
                                     , ("2010-01-12", 6.4047)
                                     , ("2010-01-13", 6.3642)
                                     , ("2010-01-14", 6.4328)
                                     , ("2010-01-15", 6.4579)
                                     ]
        :}
    

    Bulk selection is done via the select function. select works with many types of inputs. For example, we can query for a contiguous range of keys by using to:

    >>> aapl_close `select` "2010-01-04" `to` "2010-01-08"
           index | values
           ----- | ------
    "2010-01-04" | 6.5522
    "2010-01-05" | 6.5636
    "2010-01-06" | 6.4592
    "2010-01-07" | 6.4472
    "2010-01-08" | 6.4901
    

    You can also request unbounded ranges. For example all dates up to "2010-01-08" using upto:

    >>> aapl_close `select` upto "2010-01-08"
           index | values
           ----- | ------
    "2010-01-04" | 6.5522
    "2010-01-05" | 6.5636
    "2010-01-06" | 6.4592
    "2010-01-07" | 6.4472
    "2010-01-08" | 6.4901
    

    There's also the other unbound range, from:

    >>> aapl_close `select` from "2010-01-11"
           index | values
           ----- | ------
    "2010-01-11" | 6.5152
    "2010-01-12" | 6.4047
    "2010-01-13" | 6.3642
    "2010-01-14" | 6.4328
    "2010-01-15" | 6.4579
    

    Note that the bounds may contain less data than you think! For example, let's look at a 5-day range:

    >>> aapl_close `select` "2010-01-08" `to` "2010-01-12"
           index | values
           ----- | ------
    "2010-01-08" | 6.4901
    "2010-01-11" | 6.5152
    "2010-01-12" | 6.4047
    

    We've requested a range of 5 days ("2010-01-08", "2010-01-09", "2010-01-10", "2010-01-11", "2010-01-12"), but there's no data in our series with the keys "2010-01-09" and "2010-01-10", because it was the week-end (stock markets are usually closed on week-ends).

    Sometimes you want to be more specific than a contiguous range of data; select also supports bulk *random* access like so:

    >>> aapl_close `select` ["2010-01-08", "2010-01-10", "2010-01-12"]
           index | values
           ----- | ------
    "2010-01-08" | 6.4901
    "2010-01-12" | 6.4047
    

    Note above that we've requested data for the date "2010-01-10", but it's missing. Therefore, the data isn't returned. If you want to get a sub-series which has the exact index that you've asked for, you can use require in combination with an Index:

    >>> import qualified Data.Series.Index as Index
    >>> aapl_close `require` Index.fromList ["2010-01-08", "2010-01-10", "2010-01-12"]
           index |      values
           ----- |      ------
    "2010-01-08" | Just 6.4901
    "2010-01-10" |     Nothing
    "2010-01-12" | Just 6.4047
    

    Using require or select in conjunction with range is very powerful.

    Filtering and mapping

    Series support operations on both their index and their values. To illustrate this, let's load some latitude and longitude data for some cities.

    We'll assume that the following types are in scope:

    >>> import Data.Fixed (Centi)
    >>> data Position = Pos { latitude :: Centi, longitude :: Centi } deriving (Show)
    >>> :{
        let cities = Series.fromList [ ("Paris"::String , Pos  48.86    2.35)
                                     , ("New York City" , Pos  40.71   (-74.01))
                                     , ("Taipei"        , Pos  25.04    121.56)
                                     , ("Buenos Aires"  , Pos (-34.60) (-58.38)) 
                                     ]
        :}
    

    We can easily filter for data just like you would filter a list. In this example, let's find cities in the western hemisphere (i.e. cities which have negative longitudes), using filter:

    >>> Series.filter (\pos -> longitude pos < 0) cities
              index |                                      values
              ----- |                                      ------
     "Buenos Aires" | Pos {latitude = -34.60, longitude = -58.38}
    "New York City" |  Pos {latitude = 40.71, longitude = -74.01}
    

    We can transform the values of a Series using map. In this example, let's isolate the latitude of cities in the western hemisphere:

    >>> let western_cities = Series.filter (\pos -> longitude pos < 0) cities
    >>> Series.map latitude western_cities
              index | values
              ----- | ------
     "Buenos Aires" | -34.60
    "New York City" |  40.71
    

    Finally, we can summarize the Series by reducing all its values. Let's average the latitude of cities in the western hemisphere:

    >>> import Data.Series ( mean )
    >>> let latitudes = Series.map latitude western_cities
    >>> Series.fold mean latitudes
    3.05
    

    The next section introduces fold more generally.

    Folding

    Folding refers to the action of aggregating values in a Series to a single value. Folding Series is done through the fold function. Its type signature is:

    >>> :t Series.fold
    Series.fold :: Fold a b -> Series k a -> b
    

    Here, Fold a b represents a calculation which takes in values of type a, and will ultimately produce a final value of type b. Such calculations are provided by the foldl package (see Foldl), although some of its functions are re-exported by Data.Series (and Data.Series.Unboxed), such as mean.

    Let's look at an example. First, we'll need some data. We'll use end-of-day stock prices for Apple Inc:

    >>> import Data.Fixed ( Centi )
    >>> (aapl_closing :: Series String Double) <- (Series.fromList . read) <$> readFile "files/aapl.txt"
    >>> aapl_closing
           index |  values
           ----- |  ------
    "1980-12-12" |  0.1007
    "1980-12-15" | 9.54e-2
    "1980-12-16" | 8.84e-2
             ... |     ...
    "2022-01-05" |  174.92
    "2022-01-06" |   172.0
    "2022-01-07" |  172.17
    

    Normally we would use an appropriate datetime type for the index of aapl_closing, for example from the time package, but we're keeping it simple for this tutorial.

    Prices have changed a lot over the years, so we'll restrict ourselves to 2021:

    >>> let aapl_closing_2021 = aapl_closing `select` "2021-01-01" `to` "2021-12-31"
    >>> aapl_closing_2021
           index |   values
           ----- |   ------
    "2021-01-04" | 128.6174
    "2021-01-05" | 130.2076
    "2021-01-06" | 125.8246
             ... |      ...
    "2021-12-29" |   179.38
    "2021-12-30" |    178.2
    "2021-12-31" |   177.57
    

    To calculate the average closing price over the year 2021, we use fold in conjunction with mean:

    >>> Series.fold Series.mean aapl_closing_2021
    140.61256349206354
    

    One of the magic things about Fold is that it's possible to combine them in such a way that you can traverse a Series only once, which is important for good performance. As an example, we'll calculate both the mean closing price AND the standard deviation of closing prices.

    >>> let meanAndStdDev = (,) <$> Data.Series.mean <*> Data.Series.std
    >>> Series.fold meanAndStdDev aapl_closing_2021
    (140.61256349206354,14.811663837435361)
    

    See Foldl from the foldl package for more information on Fold.

    Grouping

    One important feature of Series is the ability to efficiently group values together based on their keys.

    Let's load some stock price data again for this part:

    >>> import Data.Fixed ( Centi )
    >>> (aapl_closing :: Series String Double) <- (Series.fromList . read) <$> readFile "files/aapl.txt"
    >>> aapl_closing
           index |  values
           ----- |  ------
    "1980-12-12" |  0.1007
    "1980-12-15" | 9.54e-2
    "1980-12-16" | 8.84e-2
             ... |     ...
    "2022-01-05" |  174.92
    "2022-01-06" |   172.0
    "2022-01-07" |  172.17
    

    Grouping involves two steps:

    1. Grouping keys in some way using groupBy;
    2. Aggregating the values in each group using aggregateWith or other variants.

    Let's find the highest closing price of each month. First, we need to define our grouping function:

    >>> :{
           -- | Extract the year and month from a date like XXXX-YY-ZZ. For example:
           -- 
           -- >>> month "2023-01-01"
           -- "2023-01"
           month :: String -> String
           month = take 7
        :}
    

    Then, we can group keys by month and take the maximum of each group:

    >>> aapl_closing `groupBy` month `aggregateWith` maximum
        index | values
        ----- | ------
    "1980-12" | 0.1261
    "1981-01" | 0.1208
    "1981-02" | 0.1007
          ... |    ...
    "2021-11" |  165.3
    "2021-12" | 180.33
    "2022-01" | 182.01
    

    This means, for example, that the maximum closing price for Apple stock in the month of November 2021 was $165.30 per share. This library also contains numerical aggregation functions such as mean and std. Therefore, in order to find the monthly average Apple closing price, rounded to the nearest cent:

    >>> import Data.Series (mean)
    >>> let (roundToCent :: Double -> Double) = \x -> fromIntegral ((round $ x * 100) :: Int) / 100
    >>> aapl_closing `groupBy` month `aggregateWith` (roundToCent . Series.fold mean)
        index | values
        ----- | ------
    "1980-12" |   0.11
    "1981-01" |   0.11
    "1981-02" | 9.0e-2
          ... |    ...
    "2021-11" | 154.21
    "2021-12" | 173.55
    "2022-01" | 176.16
    

    Window aggregation

    Windowing aggregation refers to the practice of aggregating values in a window around every key.

    General-purpose windowing is done using the windowing function. Let's look at its type signature:

    >>> :t windowing
    windowing
      :: Ord k =>
         (k -> Range k) -> (Series k a -> b) -> Series k a -> Series k b
    

    Here, windowing window aggfunc xs is a new series Series k b where for every key k, the values in the range window k are aggregated by aggfunc and placed in the resulting series at key k. Here's an example where for every key k, we add the values at k and k+1:

    >>> :{
    let (xs :: Series Int Int) 
          = Series.fromList [ (1, 0)
                            , (2, 1)
                            , (3, 2)
                            , (4, 3)
                            , (5, 4)
                            , (6, 5)
                            ]
    in windowing (\k -> k `to` (k + 1)) sum xs
    :}
    index | values
    ----- | ------
        1 |      1
        2 |      3
        3 |      5
        4 |      7
        5 |      9
        6 |      5
    

    windowing can be used to compute so-called rolling aggregations. An example of this is to compute the rolling mean of the last 3 keys:

    >>> import Data.Series ( mean )
    >>> :{
    let rollingMean = windowing (\k -> (k-3) `to` k) (Series.fold mean)
        (xs :: Series Int Double) 
          = Series.fromList [ (1, 0)
                            , (2, 1)
                            , (3, 2)
                            , (4, 3)
                            , (5, 4)
                            , (6, 5)
                            ]
     in (rollingMean xs) :: Series Int Double
    :}
    index | values
    ----- | ------
        1 |    0.0
        2 |    0.5
        3 |    1.0
        4 |    1.5
        5 |    2.5
        6 |    3.5
    

    Combining Series together

    An important class of operations are combining two Series together, also known as *zipping*. For lists, Haskell has zipWith. Series also have zipWith and variants:

    • zipWith, which combines two series with some elementwise function;
    • zipWithMatched, which combines two series with some elementwise function on keys which are in *both* maps;
    • zipWithStrategy, which combines two series with some elementwise function and supports custom operations to deal with missing keys;

    To illustrate the differences between the various zipping functions, consider the following two series. There's population:

    >>> :set -XNumericUnderscores
    >>> import Data.Fixed (Centi)
    >>> :{
        -- Most recent population estimate rounded to the nearest million
        let population = Series.fromList [ ("Canada"::String, 40_000_000::Centi)
                                         , ("Kenya"         , 56_000_000)
                                         , ("Poland"        , 38_000_000)
                                         , ("Singapore"     ,  6_000_000)
                                         ]
        :}
    

    and there's total land mass:

    >>> :{
        -- Land mass in square kilometer
        let landmass = Series.fromList [ ("Brazil"::String, 8_520_000::Centi)
                                       , ("Canada",         9_990_000)
                                       , ("Kenya",            580_000)
                                       , ("Poland",           313_000)
                                       ] 
        :}
    

    zipWith f left right combines the series left and right using the function f which admits two arguments, for all keys one-by-one. If a key is missing from either left or right, zipWith returns Nothing. For example, the population density per country would be:

    >>> Series.zipWith (/) population landmass
          index |      values
          ----- |      ------
       "Brazil" |     Nothing
       "Canada" |   Just 4.00
        "Kenya" |  Just 96.55
       "Poland" | Just 121.40
    "Singapore" |     Nothing
    

    Since we don't have population estimates for Brazil and no land mass information for Singapore, we can't calculate their population densities.

    Sometimes, we only care about the results of zipWith f where keys are in both series. In this case, we can use zipWithMatched:

    >>> Series.zipWithMatched (/) population landmass
       index | values
       ----- | ------
    "Canada" |   4.00
     "Kenya" |  96.55
    "Poland" | 121.40
    

    Finally, in case we want full control over what to do when a key is missing, we can use @Series.zipWithStrategy'. For example, consider the case where:

    • If population numbers are missing, I want to set the density to 0;
    • If land mass information is missing, I wait to skip calculating the density of this country.
    >>> import Data.Series (skipStrategy, constStrategy)
    >>> let noPopulationStrategy = Series.constStrategy 0
    >>> let noLandmassStrategy   = Series.skipStrategy
    >>> Series.zipWithStrategy (/) noPopulationStrategy noLandmassStrategy population landmass
          index | values
          ----- | ------
       "Canada" |   4.00
        "Kenya" |  96.55
       "Poland" | 121.40
    "Singapore" |   0.00
    

    As you can imagine, zipWithStrategy is the most general and gives the most control, but is less easy to use than zipWith and zipWithMatched.

    Conclusion

    This section concludes the introductory tutorial to the javelin package and its Data.Series module.

    For a more in-depth look at this package, you can read the full documentation for each module:

    Advanced topics

    Handling duplicate keys

    If you must build a Series with duplicate keys, you can use the fromListDuplicates or fromVectorDuplicates functions. In the example below, the key 'd' is repeated three times:

    >>> Series.fromListDuplicates [('b', 0::Int), ('a', 5), ('d', 1), ('d', -4), ('d', 7) ]
      index | values
      ----- | ------
    ('a',0) |      5
    ('b',0) |      0
    ('d',0) |      1
    ('d',1) |     -4
    ('d',2) |      7
    

    Note that the Series produced by fromListDuplicates still has unique keys, but each key is a composite of a character and an occurrence. This is reflected in the type:

    >>> :t Series.fromListDuplicates [('b', 0::Int), ('a', 5), ('d', 1), ('d', -4), ('d', 7) ]
    Series.fromListDuplicates [('b', 0::Int), ('a', 5), ('d', 1), ('d', -4), ('d', 7) ]
      :: Series (Char, Occurrence) Int
    

    Here, Occurrence is a non-negative number, and can be converted to other integer-like numbers using fromIntegral. In practice, you should aim to aggregate your Series to remove duplicate keys, for example using groupBy and grouping on the first element of the key (fst):

    >>> let xs = Series.fromListDuplicates [('b', 0::Int), ('a', 5), ('d', 1), ('d', -4), ('d', 7) ]
    >>> xs `groupBy` fst `aggregateWith` sum
    index | values
    ----- | ------
      'a' |      5
      'b' |      0
      'd' |      4
    

    Unboxed and generic series

    The Series defined in Data.Series are based on Vector from Data.Vector. This implementation is nice because such Series can hold _any_ Haskell type. However, because Haskell types can be arbitrarily complex, numerical operations on Series may not be as fast as could be.

    For simpler types such as Double and Int, a different kind of series can be used to speed up numerical calculations: Series from the Data.Series.Unboxed module. Such Series are much more limited: they can only contain datatypes which are instances of Unbox.

    This then brings the question: how can you write software which supports both ordinary Series and unboxed Series? The answer is to use functions from the Data.Series.Generic.

    For example, we could implement the dot product of two series as:

    >>> import qualified Data.Series.Generic as G
    >>> import Data.Vector.Generic ( Vector )
    >>> :{
          dot :: (Ord k, Num a, Vector v a) => G.Series v k a -> G.Series v k a -> a
          dot v1 v2 = G.sum $ G.zipWithMatched (*) v1 v2
        :}
    

    You can convert between the two types of series using the convert function.

    Replacing values

    map allows to map every value of a series. How about replacing *some* values in a series? The function replace (and its infix variant, |->) replaces values in the right operand which have an analogue in the left operand:

    >>> import Data.Series ( (|->) )
    >>> let nan = (0/0) :: Double
    >>> let right = Series.fromList [('a', 1), ('b', nan), ('c', 3), ('d', nan)]
    >>> right
    index | values
    ----- | ------
      'a' |    1.0
      'b' |    NaN
      'c' |    3.0
      'd' |    NaN
    >>> let left = Series.fromList [('b', 0::Double), ('d', 0), ('e', 0)]
    >>> left
    index | values
    ----- | ------
      'b' |    0.0
      'd' |    0.0
      'e' |    0.0
    >>> left |-> right
    index | values
    ----- | ------
      'a' |    1.0
      'b' |    0.0
      'c' |    3.0
      'd' |    0.0
    

    In the example above, the key 'e' is ignored since it was not in the right series to begin with.

    The flipped version, <-|, is also available.

    Comparison with other data structures

    Below is a table showing which operations on Data.Series have analogues for other data structures.

    ActionData.SeriesData.Map.StrictData.ListData.Vector
    Mapping valuesmapmapmapmap
    Mapping indexmapIndexmapKeys
    Mapping values with keymapWithKeymapWithKey
    Filtering valuesfilterfilterfilterfilter
    Filtering index select, filterWithKeyfilterWithKey
    Indexing by keyatlookup
    Indexing by positioniat!!
    Combine two structures key-wisezipWithmerge
    Union<>unionunion
    Group keysgroupBy