freq-0.1.1: Are you ready to get freaky?

Safe HaskellNone
LanguageHaskell2010

Freq

Contents

Description

This library provides a way to train a model that predicts the "randomness" of an input ByteString, and two datatypes to facilitate this:

FreqTrain is a datatype that can be constructed via training functions that take ByteStrings as input, and can be used with the measure function to gather an estimate of the aforementioned probability of "randomness".

Freq is a datatype that is constructed by calling the tabulate function on a FreqTrain. Freqs are meant solely for using (accessing the "randomness" values) the trained model in practise, by making significant increases to speed in exchange for less extensibility; you can neither make a change to a Freq or convert it back to a FreqTrain. In practise this however proves to not be a problem, because training usually only happens once.

Laws:

measure (f :: FreqTrain) b ≡ measure (tabulate f) b

Below is a simple illustration of how to use this library. We are going to write a small command-line application that trains on some data, and scores ByteStrings according to how random they are. We will say that a ByteString is random if it scores less than 0.05 (on a scale of 0 to 1), and not random otherwise.

First, some imports:

import Freq
import Control.Monad (forever)
 
import qualified Data.ByteString.Char8 as BC

Next, a list of FilePaths containing training data. The training data here is the same as is provided in the sample executable of this library. It consists solely of books in the Public Domain.

trainTexts :: [FilePath]
trainText
  = fmap (x -> "txtdocs/" ++ x ++ ".txt")
    -- ^
    -- | this line just tells us that all
    --   of the training data is in the txtdocs
    --   directory, and has a '.txt' file extension.

    -- | These are the text files from which we wish to train.
    -- v
      [ "2000010"
      , "2city10"
      , "80day10"
      , "alcott-little-261"
      , "byron-don-315"
      , "carol10"
      , "center_earth"
      , "defoe-robinson-103"
      , "dracula"
      , "freck10"
      , "invisman"
      , "kipling-jungle-148"
      , "lesms10"
      , "london-call-203"
      , "london-sea-206"
      , "longfellow-paul-210"
      , "madambov"
      , "monroe-d"
      , "moon10"
      , "ozland10"
      , "plgrm10"
      , "sawy210"
      , "speckldb"
      , "swift-modest-171"
      , "time_machine"
      , "war_peace"
      , "white_fang"
      , "zenda10"
      ]

We are going to use a function provided by this library called trainWithMany. Its type signature is:

trainWithMany
  :: Foldable t
  => t FilePath   -- ^ FilePaths containing training data
  -> IO FreqTrain -- ^ Frequency table generated as a result of training, inside of IO

In other words, trainWithMany takes a bunch of files, trains a model with all of the training data contained therein, and returns a FreqTrain inside of IO.

And now, we get freaky:

-- | "passes" returns a message letting the user know whether
--   or not their input ByteString was most likely random.
--   Recall that our threshold is 0.05 on a scale of 0 to 1.
passes :: Double -> String
passes x
  | x < 0.05  = "Too random!"
  | otherwise = "Looks good to me!"

main :: IO ()
main = do
  !freak <- trainWithMany trainTexts
  -- ^
  -- | create the trained model
  -- | Note that we do this strictly,
  -- | so that the model is ready to
  -- | go when we intuitively expect it
  -- | to be.
  
  let !freakTable = tabulate freak
  -- ^
  -- | optimise the trained model for
  --   read access
   
  putStrLn "Done loading frequencies."
  -- ^
  -- | let the user know that our model
  --   is done training and has finished
  --   optimising into a Freq
  
  forever $ do
  -- ^
  -- | make the following code loop forever 
    
    putStrLn "Enter text:"
    -- ^
    -- | ask the user for some text
    
    !bs <- BC.getLine
    -- ^
    -- | bs is the input ByteString to score
    
    let !score = measure freakTable bs
    -- ^
    -- | score of the ByteString!
    
    putStrLn $ "Score: " ++ show score ++ "n"
      ++ passes score
    -- ^  
    -- | print out what the score of the ByteString was,
    --   along with its 'passing status'.

This results in the following interactions, split up for readability:

>>> Done loading frequencies.
>>> Enter text:
>>> freq
>>> Score: 0.10314131395591991
>>> Looks good to me!
>>> Enter text:
>>> kjdslfkajdslkfjsd
>>> Score: 6.693203041828383e-3
>>> Too random!
>>> Enter text:
>>> William
>>> Score: 7.086442245879888e-2
>>> Looks good to me!
>>> Enter text:
>>> 8op3u92jf
>>> Score: 6.687182330334067e-3
>>> Too random!

As we can see, it rejects the keysmashed text as being too random, while the human-readable text is A-OK. I actually made the threshold of 0.05 too high - it should be somewhere between 0.01 and 0.03, but even then the outcomes would have still been the same. The digram-based approach that freq uses may seem ridiculously naive, but still maintains a high degree of accuracy.

As an example of a real-world use case, I wrote freq to use at my workplace (I work at a Network Security company) as a way to score domain names according to how random they are. Malicious users spin up fake domains frequently using strings of random characters. This can also be used to score Windows executables, since those follow the same pattern of malicious naming.

An obvious weakness of this library is that it suffers from what can be referred to as the "xkcd problem". It can score things such as xkcd poorly, even though they are perfectly legitimate domains. The fix I use is to use something like the alexa top 1 million list of domains, along with a HashMap(s) for whitelisting/blacklisting.

As a wise man once told me - "And then I freaked it."

Synopsis

Frequency table builder (trainer) type

data FreqTrain Source #

A FreqTrain is a digram-based frequency table.

One can construct a FreqTrain with train, trainWith, or trainWithMany.

One can use a trained FreqTrain with prob and measure.

mappend == <> will add the values of each of the matching keys.

It is highly recommended to convert a FreqTrain to a Freq with tabulate before using the trained model, because Freqs have O(1) reads as well as significantly faster constant-time operations, however keep in mind that Freqs cannot be neither modified nor converted back to a FreqTrain.

Instances
Eq FreqTrain Source # 
Instance details

Defined in Freq.Internal

Data FreqTrain Source # 
Instance details

Defined in Freq.Internal

Methods

gfoldl :: (forall d b. Data d => c (d -> b) -> d -> c b) -> (forall g. g -> c g) -> FreqTrain -> c FreqTrain #

gunfold :: (forall b r. Data b => c (b -> r) -> c r) -> (forall r. r -> c r) -> Constr -> c FreqTrain #

toConstr :: FreqTrain -> Constr #

dataTypeOf :: FreqTrain -> DataType #

dataCast1 :: Typeable t => (forall d. Data d => c (t d)) -> Maybe (c FreqTrain) #

dataCast2 :: Typeable t => (forall d e. (Data d, Data e) => c (t d e)) -> Maybe (c FreqTrain) #

gmapT :: (forall b. Data b => b -> b) -> FreqTrain -> FreqTrain #

gmapQl :: (r -> r' -> r) -> r -> (forall d. Data d => d -> r') -> FreqTrain -> r #

gmapQr :: (r' -> r -> r) -> r -> (forall d. Data d => d -> r') -> FreqTrain -> r #

gmapQ :: (forall d. Data d => d -> u) -> FreqTrain -> [u] #

gmapQi :: Int -> (forall d. Data d => d -> u) -> FreqTrain -> u #

gmapM :: Monad m => (forall d. Data d => d -> m d) -> FreqTrain -> m FreqTrain #

gmapMp :: MonadPlus m => (forall d. Data d => d -> m d) -> FreqTrain -> m FreqTrain #

gmapMo :: MonadPlus m => (forall d. Data d => d -> m d) -> FreqTrain -> m FreqTrain #

Ord FreqTrain Source # 
Instance details

Defined in Freq.Internal

Read FreqTrain Source # 
Instance details

Defined in Freq.Internal

Show FreqTrain Source # 
Instance details

Defined in Freq.Internal

Semigroup FreqTrain Source # 
Instance details

Defined in Freq.Internal

Monoid FreqTrain Source # 
Instance details

Defined in Freq.Internal

NFData FreqTrain Source # 
Instance details

Defined in Freq.Internal

Methods

rnf :: FreqTrain -> () #

Freaky FreqTrain Source # 
Instance details

Defined in Freq.Internal

Methods

prob :: FreqTrain -> Word8 -> Word8 -> Double Source #

Construction

empty :: FreqTrain Source #

O(1). The empty frequency table.

singleton Source #

Arguments

:: Word8

Outer key

-> Word8

Inner key

-> Double

Weight

-> FreqTrain

The singleton frequency table

O(1). A Frequency table with a single entry.

Training

train :: ByteString -> FreqTrain Source #

Given a ByteString consisting of training data, build a Frequency table.

trainWith Source #

Arguments

:: FilePath

FilePath containing training data

-> IO FreqTrain

Frequency table generated as a result of training, inside of IO.

Given a FilePath containing training data, build a Frequency table inside of the IO monad.

trainWithMany Source #

Arguments

:: Foldable t 
=> t FilePath

FilePaths containing training data

-> IO FreqTrain

Frequency table generated as a result of training, inside of IO.

Given a list of FilePath containing training data, build a Frequency table inside of the IO monad.

Using a trained model

tabulate :: FreqTrain -> Freq Source #

Optimise a FreqTrain for O(1) read access.

data Freq Source #

A variant of FreqTrain that holds identical information but is optimised for reads. There are no operations that imbue a Freq with additional information.

Reading from a Freq is orders of magnitude faster than reading from a FreqTrain. It is highly recommended that you use your trained model by first converting a FreqTrain to a Freq with tabulate.

Instances
Eq Freq Source # 
Instance details

Defined in Freq.Internal

Methods

(==) :: Freq -> Freq -> Bool #

(/=) :: Freq -> Freq -> Bool #

Show Freq Source #

This exists for debugging purposes

Instance details

Defined in Freq.Internal

Methods

showsPrec :: Int -> Freq -> ShowS #

show :: Freq -> String #

showList :: [Freq] -> ShowS #

Binary Freq Source # 
Instance details

Defined in Freq.Internal

Methods

put :: Freq -> Put #

get :: Get Freq #

putList :: [Freq] -> Put #

Freaky Freq Source # 
Instance details

Defined in Freq.Internal

Methods

prob :: Freq -> Word8 -> Word8 -> Double Source #

measure :: Freaky a => a -> ByteString -> Double Source #

Given a Frequency table and a ByteString, measure returns the probability that the ByteString is not randomised. The accuracy of measure is is heavily affected by your training data.

prob :: Freaky a => a -> Word8 -> Word8 -> Double Source #

Given a Frequency table and characters c1 and c2, what is the probability that c1 follows c2?

Pretty Printing

prettyFreqTrain :: FreqTrain -> IO () Source #

Pretty-print a FreqTrain.