cassava-records
A library extension for cassava
(Haskell CSV parser library) that
automatically creates a Record data type given an columnar input file. eg
a CSV file.
Say you are working on a project that involves processing a number of
comma separated or tab serparated files. Assuming, you are using
cassava for loading the input files, here is a typical workflow you
would follow
a. Inspect the file that contains Salaries
of Employees
.
b. Create a Record
data type called Salaries
to reflect
the columns and types found in the file
c. Create required instances of the Salaries
data type that may be
required to load the files with cassava
.
Now, imagine this file you are inspecting to contains tens or hundreds
of columns. Now, as a good Haskeller you will want to automate steps
(a) and (b) to the extend possible. That is precisely, what this
library does.
cassava-records
performs the following tasks. Given, a input file
(command or tab-seperated for example), it reads the whole file,
infers some basic data types for each column and automatically creates
a Record
of inferred data types using Template Haskell
.
Quick Start
Example 1 :
Using data/salaries_simple.csv
emp_no,name,salary,status,years
1,John Doe,100.0,True,1
2,Jill Doe, 200.10,False,2
3,John Doe Sr,101.0,T,3
4,Jill Doe Sr, 10101.10,f,4.2
5,John Doe Jr,1010101.0,true,5.1
6,Jill Doe Jr, 10101.10,false,6
{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}
import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT
$(makeCsvRecord "Salaries" "data/salaries_simple.csv" "_" commaOptions)
The makeCsvRecord
function take 4 arguments,
"Salaries"
: A String
that will be used as name for
the Record
.
data/salaries_simple.csv
: path to the input file
"_"
: string to prefix each field. Useful, if we need to build lens
for this record
commaOptions
: defaultDecodeOptions
defined
in cassava
library
If you load this code in GHCi
, we will see
1 >:info Salaries
data Salaries
= Salaries {
_emp_no:: Integer,
_name :: Text,
_salary :: Double,
_status :: Bool,
_years:: Double}
Note that all column names are in lower case and "_" has been prefixed
to the column names.
To be consistent, cassava-record
converts all column headers to
lower-case before created corresponding field names for each column
header. Therefore, if column headers were all upper-case, we need to
provide a field modifier while creating ToNamedRecord
and FromNamedRecord
instances for cassava
.
Note, that if the column headers are mixed case, it become
tricky. Current version of the library does not work very well with
mixed case column headers.
There is a convenience method called makeInstances
that can create
the instances required for cassava
.The instances created use the
default fieldModifieroptions
settings shown below.
fieldModifierOptions :: Options
fieldModifierOptions = defaultOptions { fieldLabelModifier = rmUnderscore }
where
rmUnderscore ('_':str) = DT.unpack . DT.toUpper . DT.pack $ str
rmUnderscore str = str
-- the ToNamedRecord and FromNamedRecord are needed by Cassava since
-- we prefix
instance ToNamedRecord Salaries where
toNamedRecord = genericToNamedRecord fieldModifierOptions
instance FromNamedRecord Salaries where
parseNamedRecord = genericParseNamedRecord fieldModifierOptions
main :: IO ()
main = do
v <- loadData "data/salaries_simple.csv":: IO (V.Vector Salaries)
putStrLn . show $ v
In GHCi
we see (formatted for clarity)
2 >loadData "data/salaries.csv"
[Salaries {_emp_no = 1, _name = "John Doe", _salary = 100.0, _status = False, _years = 1.0},
Salaries {_emp_no = 2, _name = "Jill Doe", _salary = 200.1, _status = False, _years = 2.0},
Salaries {_emp_no = 3, _name = "John Doe Sr", _salary = 101.0, _status = True, _years = 3.0},
Salaries {_emp_no = 4, _name = "Jill Doe Sr", _salary = 10101.1, _status = False, _years = 4.2},
Salaries {_emp_no = 5, _name = "John Doe Jr", _salary = 1010101.0, _status = False, _years = 5.1},
Salaries {_emp_no = 6, _name = "Jill Doe Jr", _salary = 10101.1, _status = False, _years = 6.0}]
Note, the type inference in the above example is as follows:
- If a column has values from the set {
true
, t
, false
, f
}
(ignoring case) then the inferred type is Bool
.
- If a column has values that are all numeric, then an
Integer
type is attempted, or else a Double
is infered. For example
for emp_no
the infered type is a Integer
whereas for years
the type is Double
.
- For all other cases, a
Text
type is inferred.
Example 2 (Missing Values)
The library also supports type inference when values are missing. For example in, data/salaries_mixed_input.csv
emp_no,name,salary,status,years
1,John Doe,100.0,True,1
2,Jill Doe, 200.10,False,2
3,John Doe Sr,101.0,T,
4,Jill Doe Sr,10101.10,,4.2
5,John Doe Jr,,true,5.1
6,, 10101.10,false,6
the status
for Jill Doe Jr is missing and the salary
for
John Doe Sr is missing. In this case, the type as wrapped in a Maybe
type.
In that case, the record instance we get will be as follows:
3 >:info Salaries
data Salaries
= Salaries {
_emp_no:: Integer,
_name :: Maybe Text,
_salary :: Maybe Double,
_status :: Maybe Bool,
_years:: Maybe Double}
Loading this data, would produce the following output
{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}
import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT
$(makeCsvRecord "SalariesMixed" "data/salaries_mixed_input.csv" "_" commaOptions)
$(makeInstance "SalariesMixed")
-- ^ note that we can use this function instead of manually defining
-- all instances required by Cassava
main :: IO ()
main = do
v <- loadData "data/salaries_mixed_input.csv":: IO (V.Vector SalariesMixed)
putStrLn . show $ v
The output will be as follows:
[SalariesMixed {_emp_no = 1, _name = Just "John Doe", _salary = Just 100.0, _status = Just False, _years = Just 1.0},
SalariesMixed {_emp_no = 2, _name = Just "Jill Doe", _salary = Just 200.1, _status = Just False, _years = Just 2.0},
SalariesMixed {_emp_no = 3, _name = Just "John Doe Sr", _salary = Just 101.0, _status = Just True, _years = Nothing},
SalariesMixed {_emp_no = 4, _name = Just "Jill Doe Sr", _salary = Just 10101.1, _status = Nothing, _years = Just 4.2},
SalariesMixed {_emp_no = 5, _name = Just "John Doe Jr", _salary = Nothing, _status = Just False, _years = Just 5.1},
SalariesMixed {_emp_no = 6, _name = Nothing, _salary = Just 10101.1, _status = Just False, _years = Just 6.0}]
Here is a full working code that uses both the examples:
{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}
{-#LANGUAGE ScopedTypeVariables #-}
{-#LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE DeriveDataTypeable #-}
module Main where
import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT
import qualified Text.PrettyPrint.Tabulate as T
import Language.Haskell.TH
-- import Control.Lens hiding (element)
$(makeCsvRecord "Salaries" "data/salaries_simple.csv" "_" commaOptions)
-- $(makeInstance "Salaries")
$(makeCsvRecord "SalariesMixed" "data/salaries_mixed_input.csv" "_" commaOptions)
$(makeInstance "SalariesMixed")
-- the following instance is not required, if $(makeInstance Salaries) statement
-- is spliced in (currently commented in the example)
myOptions :: Options
myOptions = defaultOptions { fieldLabelModifier = rmUnderscore }
where
rmUnderscore ('_':str) = DT.unpack . DT.toUpper . DT.pack $ str
rmUnderscore str = str
instance ToNamedRecord Salaries where
toNamedRecord = genericToNamedRecord myOptions
instance FromNamedRecord Salaries where
parseNamedRecord = genericParseNamedRecord myOptions
main :: IO ()
main = do
v <- loadData "data/salaries_simple.csv" :: IO (V.Vector Salaries)
v1 <- loadData "data/salaries_mixed_input.csv" :: IO (V.Vector SalariesMixed)
putStrLn . show $ v
putStrLn . show $ v1
Caveats (Or list of future enhancements)
- The columns names along with prefix("_") should be a valid Haskell field
names. For example, column names cannot have spaces or other
characters not supported for use as
field
names of Record
data type.
- The library loads the whole file during compilation to infer
types. Given the size of the file, this will increase the compile
time. Alternative workflows, like stripping the file or dumping the
created slice into a file is recommended. In the future, the
makeCsvRecord function can take a parameter to specify the max
number of rows that can be used to infer the types.
- The inferred types are limited. Text, Bool, Integer, Double and the MayBe
variants of those. Future, support may include DateTime.
- Mixed case column headers not automatically supported. A more
complex form of
fieldOptionModifiers
needs to be provided.
- Currently there is no option to provide custom types.