{-# OPTIONS_GHC -O2 -Wall #-}
{-# OPTIONS_GHC -fno-warn-unused-imports #-}

{-| This library provides a way to train a model
    that predicts the "randomness" of an input @'ByteString'@,
    and two datatypes to facilitate this:

    @'FreqTrain'@ is a datatype that can be constructed via
    training functions that take @'ByteString'@s as input, and
    can be used with the @'measure'@ function to gather an
    estimate of the aforementioned probability of "randomness".

    @'Freq'@ is a datatype that is constructed by calling the @'tabulate'@
    function on a @'FreqTrain'@. @'Freq'@s are meant solely for using (accessing
    the "randomness" values) the trained model in practise, by making
    significant increases to speed in exchange for less extensibility;
    you can neither make a change to a @'Freq'@ or convert it back to
    a @'FreqTrain'@. In practise this however proves to not be a problem,
    because training usually only happens once.

    Laws:
    
    @ 'measure' (f :: 'FreqTrain') b ≡ 'measure' ('tabulate' f) b @

    
    Below is a simple illustration of how to use this library.
    We are going to write a small command-line application that
    trains on some data, and scores @'ByteString'@s according to how
    random they are. We will say that a @'ByteString'@ is 'random'
    if it scores less than 0.05 (on a scale of 0 to 1), and not random
    otherwise.
    
    First, some imports:
  
  @ 
  import Freq
  import Control.Monad (forever)
  
  import qualified Data.ByteString.Char8 as BC
  @
  
    Next, a list of @'FilePath'@s containing training data.
    The training data here is the same as is provided in
    the sample executable of this library. It consists solely
    of books in the Public Domain.
  
  @ 
  trainTexts :: [FilePath]
  trainText
    = fmap (\x -> "txtdocs/" ++ x ".txt")
      -- ^ this line just tells us that all
      --   of the training data is in the 'txtdocs'
      --   directory, and has a '.txt' file extension.
        [ "2000010"
        , "2city10"
        , "80day10"
        , "alcott-little-261"
        , "byron-don-315"
        , "carol10"
        , "center_earth"
        , "defoe-robinson-103"
        , "dracula"
        , "freck10"
        , "invisman"
        , "kipling-jungle-148"
        , "lesms10"
        , "london-call-203"
        , "london-sea-206"
        , "longfellow-paul-210"
        , "madambov"
        , "monroe-d"
        , "moon10"
        , "ozland10"
        , "plgrm10"
        , "sawy210"
        , "speckldb"
        , "swift-modest-171"
        , "time_machine"
        , "war_peace"
        , "white_fang"
        , "zenda10"
        ]
  @

    We are going to use a function provided by this library
    called @'trainWithMany'@. Its type signature is:

  @
  trainWithMany
    :: Foldable t
    => t FilePath   -- ^ FilePaths containing training data
    -> IO FreqTrain -- ^ Frequency table generated as a result of training, inside of 'IO'
  @
    
    In other words, @'trainWithMany'@ takes a bunch of files,
    trains a model with all of the training data contained therein,
    and returns a @'FreqTrain'@ inside of @'IO'@.

    And now, we get freaky:

  @

  -- | "passes" returns a message letting the user know whether
  --   or not their input 'ByteString' was most likely random.
  --   Recall that our threshold is 0.05 on a scale of 0 to 1.
  passes :: Double -> String
  passes x
    | x < 0.05  = "Too random!"
    | otherwise = "Looks good to me!"

  main :: IO ()
  main = do
    !freak <- trainWithMany trainTexts
    -- ^ create the trained model
    
    let !freakTable = tabulate freak
    -- ^ optimise the trained model for
    --   read access
    
    putStrLn "Done loading frequencies."
    -- ^ let the user known that our model
    --   is done training and has finished
    --   optimising into a 'Freq'
    
    forever $ do
    -- ^ make the following loop forever 
      
      putStrLn "Enter text:"
      -- ^ ask the user for some text
      
      !bs <- BC.getLine
      -- ^ bs is the input 'ByteString' to score
      
      let !score = measure freakTable bs
      -- ^ score of the 'ByteString'!
      
      putStrLn $ "Score: " ++ show score ++ "\n"
        ++ passes score
      -- ^ print out what the score of the 'ByteString' was,
      --   along with its 'passing status'.
  @

    This results in the following interactions, split up for readability:

  >>> Done loading frequencies.
  >>> Enter text:
  >>> freq
  >>> Score: 0.10314131395591991
  >>> Looks good to me!
  
  >>> Enter text:
  >>> kjdslfkajdslkfjsd
  >>> Score: 6.693203041828383e-3
  >>> Too random!
  
  >>> Enter text:
  >>> William
  >>> Score: 7.086442245879888e-2
  >>> Looks good to me!

  >>> Enter text:
  >>> 8op3u92jf
  >>> Score: 6.687182330334067e-3
  >>> Too random!
    
    As we can see, it rejects the keysmashed text as being too random,
    while the human-readable text is A-OK. I actually made the threshold
    of 0.05 too high - it should be somewhere between 0.01 and 0.03, but
    even then the outcomes would have still been the same. The digram-based
    approach that 'freq' uses may seem ridiculously naive, but still
    maintains a high degree of accuracy.

    As an example of a real-world use case, I wrote 'freq' to use at my
    workplace (I work at a Network Security company) as a way to score
    TLDs (top-level domains) according to how random they are. Malicious
    users spin up fake domains frequently using strings of random characters
    as the TLD. This can also be used to score Windows executables, since
    those follow the same pattern of malicious naming.

    An obvious weakness of this library is that it suffers from what can
    be referred to as the "xkcd problem". It can score things such as 'xkcd'
    poorly, even though they are perfectly legitimate TLDs. The fix I use is
    to use something like the alexa top 1 million list of TLDs, along with a
    HashMap(s) for whitelisting/blacklisting.

    As a wise man once told me - "And then I freaked it."
-}

module Freq
  ( -- * Frequency table builder (trainer) type
    FreqTrain
    
    -- * Construction
  , empty 
  , singleton

    -- * Training
  , train 
  , trainWith
  , trainWithMany

    -- * Using a trained model
  , tabulate
  , Freq
  , measure
  , prob
  
    -- * Pretty Printing
  , prettyFreqTrain
  ) where

import Data.ByteString (ByteString)
import Freq.Internal