# tasty-bench

Featherlight benchmark framework (only one file!) for performance measurement with API mimicking [`criterion`](http://hackage.haskell.org/package/criterion) and [`gauge`](http://hackage.haskell.org/package/gauge).

## How lightweight is it?

There is only one source file `Test.Tasty.Bench` and no external dependencies
except [`tasty`](http://hackage.haskell.org/package/tasty).
So if you already depend on `tasty` for a test suite, there
is nothing else to install.

Compare this to `criterion` (10+ modules, 50+ dependencies) and `gauge` (40+ modules, depends on `basement` and `vector`).

## How is it possible?

Our benchmarks are literally regular `tasty` tests, so we can leverage all existing
machinery for command-line options, resource management, structuring,
listing and filtering benchmarks, running and reporting results. It also means
that `tasty-bench` can be used in conjunction with other `tasty` ingredients.

Unlike `criterion` and `gauge` we use a very simple statistical model described below.
This is arguably a questionable choice, but it works pretty well in practice.
A rare developer is sufficiently well-versed in probability theory
to make sense and use of all numbers generated by `criterion`.

## How to switch?

[Cabal mixins](https://cabal.readthedocs.io/en/3.4/cabal-package.html#pkg-field-mixins)
allow to taste `tasty-bench` instead of `criterion` or `gauge`
without changing a single line of code:

```cabal
cabal-version: 2.0

benchmark foo
  ...
  build-depends:
    tasty-bench
  mixins:
    tasty-bench (Test.Tasty.Bench as Criterion)
```

This works vice versa as well: if you use `tasty-bench`, but at some point
need a more comprehensive statistical analysis,
it is easy to switch temporarily back to `criterion`.

## How to write a benchmark?

Benchmarks are declared in a separate section of `cabal` file:

```cabal
cabal-version:   2.0
name:            bench-fibo
version:         0.0
build-type:      Simple
synopsis:        Example of a benchmark

benchmark bench-fibo
  main-is:       BenchFibo.hs
  type:          exitcode-stdio-1.0
  build-depends: base, tasty-bench
```

And here is `BenchFibo.hs`:

```haskell
import Test.Tasty.Bench

fibo :: Int -> Integer
fibo n = if n < 2 then toInteger n else fibo (n - 1) + fibo (n - 2)

main :: IO ()
main = defaultMain
  [ bgroup "fibonacci numbers"
    [ bench "fifth"     $ nf fibo  5
    , bench "tenth"     $ nf fibo 10
    , bench "twentieth" $ nf fibo 20
    ]
  ]
```

Since `tasty-bench` provides an API compatible with `criterion`,
one can refer to [its documentation](http://www.serpentine.com/criterion/tutorial.html#how-to-write-a-benchmark-suite) for more examples.

## How to read results?

Running the example above (`cabal bench` or `stack bench`)
results in the following output:

```
All
  fibonacci numbers
    fifth:     OK (2.13s)
       63 ns ± 3.4 ns
    tenth:     OK (1.71s)
      809 ns ±  73 ns
    twentieth: OK (3.39s)
      104 μs ± 4.9 μs

All 3 tests passed (7.25s)
```

The output says that, for instance, the first benchmark
was repeatedly executed for 2.13 seconds (wall time),
its mean time was 63 nanoseconds and,
assuming ideal precision of a system clock,
execution time does not often diverge from the mean
further than ±3.4 nanoseconds
(double standard deviation, which for normal distributions
corresponds to [95%](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
probability). Take standard deviation numbers
with a grain of salt; there are lies, damned lies, and statistics.

Note that this data is not directly comparable with `criterion` output:

```
benchmarking fibonacci numbers/fifth
time                 62.78 ns   (61.99 ns .. 63.41 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 62.39 ns   (61.93 ns .. 62.94 ns)
std dev              1.753 ns   (1.427 ns .. 2.258 ns)
```

One might interpret the second line as saying that
95% of measurements fell into 61.99–63.41 ns interval, but this is wrong.
It states that the [OLS regression](https://en.wikipedia.org/wiki/Ordinary_least_squares)
of execution time (which is not exactly the mean time) is most probably
somewhere between 61.99 ns and 63.41 ns,
but does not say a thing about individual measurements.
To understand how far away a typical measurement deviates
you need to add/subtract double standard deviation yourself
(which gives 62.78 ns ± 3.506 ns, similar to `tasty-bench` above).

To add to the confusion, `gauge` in `--small` mode outputs
not the second line of `criterion` report as one might expect,
but a mean value from the penultimate line and a standard deviation:

```
fibonacci numbers/fifth                  mean 62.39 ns  ( +- 1.753 ns  )
```

The interval ±1.753 ns answers
for [68%](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)
of samples only, double it to estimate the behavior in 95% of cases.

## Statistical model

Here is a procedure used by `tasty-bench` to measure execution time:

1. Set _n_ ← 1.
2. Measure execution time _tₙ_ of _n_ iterations
   and execution time _t₂ₙ_ of _2n_ iterations.
3. Find _t_ which minimizes deviation of (_nt_, _2nt_) from (_tₙ_, _t₂ₙ_).
4. If deviation is small enough (see `--stdev` below),
   return _t_ as a mean execution time.
5. Otherwise set _n_ ← _2n_ and jump back to Step 2.

This is roughly similar to the linear regression approach which `criterion` takes,
but we fit only two last points. This allows us to simplify away all heavy-weight
statistical analysis. More importantly, earlier measurements,
which are presumably shorter and noisier, do not affect overall result.
This is in contrast to `criterion`, which fits all measurements and
is biased to use more data points corresponding to shorter runs
(it employs _n_ ← _1.05n_ progression).

An alert reader could object that we measure standard deviation
for samples with _n_ and _2n_ iterations, but report
it scaled to a single iteration.
Strictly speaking, this is justified only if we assume
that deviating factors are either roughly periodic
(e. g., coarseness of a system clock, garbage collection)
or are likely to affect several successive iterations in the same way
(e. g., slow down by another concurrent process).

Obligatory disclaimer: statistics is a tricky matter, there is no
one-size-fits-all approach.
In the absence of a good theory
simplistic approaches are as (un)sound as obscure ones.
Those who seek statistical soundness should rather collect raw data
and process it themselves in R/Python. Data reported by `tasty-bench`
is only of indicative and comparative significance.

## Tip

Passing `+RTS -T` (via `cabal bench --benchmark-options '+RTS -T'`
or `stack bench --ba '+RTS -T'`) enables `tasty-bench` to estimate and report
memory usage such as allocated and copied bytes.

## Command-line options

Use `--help` to list command-line options.

* `-p`, `--pattern`

  This is a standard `tasty` option, which allows filtering benchmarks
  by a pattern or `awk` expression. Please refer to
  [`tasty` documentation](https://github.com/feuerbach/tasty#patterns)
  for details.

* `--csv`

  File to write results in CSV format. If specified, suppresses console output.

* `-t`, `--timeout`

  This is a standard `tasty` option, setting timeout for individual benchmarks
  in seconds. Use it when benchmarks tend to take too long: `tasty-bench` will make
  an effort to report results (even if of subpar quality) before timeout. Setting
  timeout too tight (insufficient for at least three iterations of benchmark)
  will result in a benchmark failure. Do not use `--timeout` without a reason:
  it forks an additional thread and thus affects reliability of measurements.

* `--stdev`

  Target relative standard deviation of measurements in percents (5% by default).
  Large values correspond to fast and loose benchmarks, and small ones to long and precise.
  If it takes far too long, consider setting `--timeout`,
  which will interrupt benchmarks, potentially before reaching the target deviation.