User guide
|
Discord
DataFrame
Tabular data analysis in Haskell. Read CSV, Parquet, and JSON files, transform columns with a typed expression DSL, and optionally lock down your entire schema at the type level for compile-time safety.
The library ships three API layers — all operating on the same underlying DataFrame type at runtime:
- Untyped (
import qualified DataFrame as D) — string-based column names, great for exploration and scripting.
- Typed (
import qualified DataFrame.Typed as T) — phantom-type schema tracking with compile-time column validation.
- Monadic API — write your transformation as a self contained pipeline.
This README is a runnable scripths notebook. Every Haskell block runs top-to-bottom in one shared session against the datasets in ./data. Reproduce every output below with scripths docs/base_scripts/base_readme.md -o README.md from the repo root.
Why this library?
- Concise, declarative, composable data pipelines using the
|> pipe operator.
- Choose your level of type safety: keep it lightweight for quick analysis, or lock it down for production pipelines.
- High performance from Haskell's optimizing compiler and an efficient columnar memory model with bitmap-backed nullability.
- Designed for interactivity: a custom REPL, IHaskell notebook support, terminal and web plotting, and helpful error messages.
Install
cabal update
cabal install dataframe
To use as a dependency in a project:
build-depends: base >= 4, dataframe
Works with GHC 9.4 through 9.12. A custom REPL with all imports pre-loaded is available after installing:
dataframe
Quick Start
Group sales by product and compute totals. The first block carries the
scripths cabal directives and the imports shared by the rest of the document;
you can also drop the same code into an Example.hs and run it with
cabal run Example.hs after adding a #!/usr/bin/env cabal header.
-- cabal: build-depends: dataframe, text
-- cabal: default-extensions: OverloadedStrings, TypeApplications, TemplateHaskell, DataKinds, TypeFamilies, FlexibleInstances, FlexibleContexts, ScopedTypeVariables, DeriveGeneric, UndecidableInstances
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import qualified DataFrame.Typed as DT
import DataFrame.Operators
import Data.Text (Text)
import Data.Int (Int64)
sales = D.fromNamedColumns
[ ("product", D.fromList [1, 1, 2, 2, 3, 3 :: Int])
, ("amount", D.fromList [100, 120, 50, 20, 40, 30 :: Int])
]
-- Group by product and compute totals
sales
|> D.groupBy ["product"]
|> D.aggregate [ F.sum (F.col @Int "amount") `as` "total"
, F.count (F.col @Int "amount") `as` "orders"
]
|> D.toMarkdown'
product Int |
total Int |
orders Int |
| 1 |
220 |
2 |
| 3 |
70 |
2 |
| 2 |
70 |
2 |
Reading from files works the same way:
fileDf <- D.readCsv "./data/housing.csv"
fileDf <- D.readParquet "./data/mtcars.parquet"
-- Hugging Face datasets (needs network access):
-- fileDf <- D.readParquet "hf://datasets/scikit-learn/iris/default/train/0000.parquet"
D.dimensions fileDf
(32,12)
Interactive REPL
The dataframe REPL comes with all imports pre-loaded. Here's a typical exploration session (each block runs as a cell):
df <- D.readCsv "./data/housing.csv"
D.dimensions df
(20640,10)
D.describeColumns df |> D.toMarkdown'
Column Name Text |
# Non-null Values Int |
# Null Values Int |
Type Text |
| total_bedrooms |
20433 |
207 |
Maybe Double |
| ocean_proximity |
20640 |
0 |
Text |
| median_house_value |
20640 |
0 |
Double |
| median_income |
20640 |
0 |
Double |
| households |
20640 |
0 |
Double |
| population |
20640 |
0 |
Double |
| total_rooms |
20640 |
0 |
Double |
| housing_median_age |
20640 |
0 |
Double |
| latitude |
20640 |
0 |
Double |
| longitude |
20640 |
0 |
Double |
The :declareColumns macro ($(D.declareColumns df) outside the REPL) generates typed column references from a dataframe, so you can use column names directly in expressions instead of writing F.col @Double "median_income" every time:
$(D.declareColumns df)
df |> D.groupBy ["ocean_proximity"]
|> D.aggregate [F.mean median_house_value `as` "avg_value"]
|> D.toMarkdown'
ocean_proximity Text |
avg_value Double |
| NEAR BAY |
259212.31179039303 |
| NEAR OCEAN |
249433.97742663656 |
| INLAND |
124805.39200122119 |
| <1H OCEAN |
240084.28546409807 |
| ISLAND |
380440.0 |
Create new columns from existing ones:
df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 3 |> D.toMarkdown'
longitude Double |
latitude Double |
housing_median_age Double |
total_rooms Double |
total_bedrooms Maybe Double |
population Double |
households Double |
median_income Double |
median_house_value Double |
ocean_proximity Text |
rooms_per_household Double |
| -122.23 |
37.88 |
41.0 |
880.0 |
Just 129.0 |
322.0 |
126.0 |
8.3252 |
452600.0 |
NEAR BAY |
6.984126984126984 |
| -122.22 |
37.86 |
21.0 |
7099.0 |
Just 1106.0 |
2401.0 |
1138.0 |
8.3014 |
358500.0 |
NEAR BAY |
6.238137082601054 |
| -122.24 |
37.85 |
52.0 |
1467.0 |
Just 190.0 |
496.0 |
177.0 |
7.2574 |
352100.0 |
NEAR BAY |
8.288135593220339 |
Type mismatches are caught as compile errors — adding a Double column to a Text column won't silently produce garbage:
dataframe> df |> D.derive "nonsense" (latitude + ocean_proximity)
<interactive>:14:47: error: [GHC-83865]
• Couldn't match type 'Text' with 'Double'
Expected: Expr Double
Actual: Expr Text
• In the second argument of '(+)', namely 'ocean_proximity'
In the second argument of 'derive', namely
'(latitude + ocean_proximity)'
Template Haskell
For scripts and projects, Template Haskell can generate column bindings at compile time.
Generate column references from a CSV
declareColumnsFromCsvFile (in DataFrame.TH, also re-exported from DataFrame)
reads your CSV at compile time and generates typed Expr bindings for every column:
-- Reads housing.csv at compile time and generates:
-- latitude :: Expr Double
-- total_rooms :: Expr Double
-- ocean_proximity :: Expr Text
-- ... one binding per column
$(D.declareColumnsFromCsvFile "./data/housing.csv")
df <- D.readCsv "./data/housing.csv"
df |> D.derive "rooms_per_household" (total_rooms / households)
|> D.filterWhere (median_income .>. 5)
|> D.groupBy ["ocean_proximity"]
|> D.aggregate [F.mean median_house_value `as` "avg_value"]
|> D.toMarkdown'
ocean_proximity Text |
avg_value Double |
| NEAR BAY |
361441.9354304636 |
| NEAR OCEAN |
380041.63071895426 |
| INLAND |
234817.86695906433 |
| <1H OCEAN |
333411.75125531096 |
Compare this to the manual version which requires spelling out every column name and type:
-- Without TH — every column needs its name and type spelled out
df |> D.derive "rooms_per_household"
(F.col @Double "total_rooms" / F.col @Double "households")
|> D.filterWhere (F.col @Double "median_income" .>. F.lit 5)
|> D.take 5
|> D.toMarkdown'
longitude Double |
latitude Double |
housing_median_age Double |
total_rooms Double |
total_bedrooms Maybe Double |
population Double |
households Double |
median_income Double |
median_house_value Double |
ocean_proximity Text |
rooms_per_household Double |
| -122.23 |
37.88 |
41.0 |
880.0 |
Just 129.0 |
322.0 |
126.0 |
8.3252 |
452600.0 |
NEAR BAY |
6.984126984126984 |
| -122.22 |
37.86 |
21.0 |
7099.0 |
Just 1106.0 |
2401.0 |
1138.0 |
8.3014 |
358500.0 |
NEAR BAY |
6.238137082601054 |
| -122.24 |
37.85 |
52.0 |
1467.0 |
Just 190.0 |
496.0 |
177.0 |
7.2574 |
352100.0 |
NEAR BAY |
8.288135593220339 |
| -122.25 |
37.85 |
52.0 |
1274.0 |
Just 235.0 |
558.0 |
219.0 |
5.6431000000000004 |
341300.0 |
NEAR BAY |
5.8173515981735155 |
| -122.29 |
37.82 |
49.0 |
135.0 |
Just 29.0 |
86.0 |
23.0 |
6.1183 |
75000.0 |
NEAR BAY |
5.869565217391305 |
Generate a schema type from a CSV
deriveSchemaFromCsvFile generates a type synonym for use with the typed API — instead of manually writing out every column name and type:
-- Generates:
-- type HousingSchema = '[ DT.Column "longitude" Double
-- , DT.Column "latitude" Double
-- , DT.Column "total_rooms" Double
-- , ...
-- ]
$(DT.deriveSchemaFromCsvFile "HousingSchema" "./data/housing.csv")
Generate a schema (and a row bridge) from a record ADT
When the canonical row shape lives in your code as a Haskell record,
deriveSchemaFromType produces both the typed schema and a HasSchema
instance that converts between [Order] and a DataFrame (or
TypedDataFrame OrderSchema) at runtime:
data Order = Order
{ orderId :: Int64
, region :: Text
, amount :: Double
} deriving (Show, Eq)
$(DT.deriveSchemaFromType ''Order)
-- expands to:
-- type OrderSchema =
-- '[DT.Column "order_id" Int64, DT.Column "region" Text, DT.Column "amount" Double]
-- instance DT.HasSchema Order where
-- type Schema Order = OrderSchema
-- toColumns = ...
-- fromColumns = ...
xs :: [Order]
xs = [Order 1 "us" 10.0, Order 2 "eu" 20.5]
-- Untyped: [Order] -> DataFrame
ordersDf :: D.DataFrame
ordersDf = D.fromRecords xs
ordersDf |> D.toMarkdown'
order_id Int64 |
region Text |
amount Double |
| 1 |
us |
10.0 |
| 2 |
eu |
20.5 |
The runtime-checked round-trip back to records:
D.toRecords ordersDf :: Either Text [Order]
Right [Order {orderId = 1, region = "us", amount = 10.0},Order {orderId = 2, region = "eu", amount = 20.5}]
And the typed bridge — [Order] to TypedDataFrame OrderSchema and back:
DT.thaw (DT.fromRecordsTyped xs :: DT.TypedDataFrame OrderSchema) |> D.toMarkdown'
order_id Int64 |
region Text |
amount Double |
| 1 |
us |
10.0 |
| 2 |
eu |
20.5 |
Field names are translated camelCase → snake_case by default; override
the translation with deriveSchemaFromTypeWith defaultSchemaOptions{nameTransform = id} (or any String -> String).
If all you need is a runtime Schema to drive readCsvWithSchema (no
typed-dataframe machinery), there's a companion splice in
DataFrame.Internal.Schema (re-exported from DataFrame):
$(D.deriveSchema ''Order)
-- emits:
-- orderSchema :: Schema
-- orderSchema = makeSchema [("order_id", schemaType @Int64), ...]
-- orderOrderId :: Expr Int64
-- orderOrderId = col "order_id"
-- orderRegion :: Expr Text
-- orderRegion = col "region"
-- orderAmount :: Expr Double
-- orderAmount = col "amount"
orders :: IO D.DataFrame
orders = do
raw <- D.readCsvWithSchema orderSchema "./data/orders.csv"
pure (D.filter orderAmount (> 100) raw)
Each record field gets a typed accessor named <lower-first TyConName><UpperFirst FieldName>,
so data Order { customerId :: Int } yields orderCustomerId :: Expr Int = col "customer_id".
That's the same shape as $(D.declareColumns df) produces from a runtime
DataFrame, but driven off the ADT instead of an existing frame.
If you'd rather not depend on Template Haskell, the same schema is
available via GHC.Generics (shown here on an equivalent record):
import GHC.Generics (Generic)
import DataFrame.Typed (Schema)
data OrderG = OrderG
{ orderGId :: Int64
, regionG :: Text
, amountG :: Double
} deriving (Generic)
type OrderGSchema = DT.SchemaOf OrderG
instance DT.HasSchema OrderG where
type Schema OrderG = OrderGSchema
toColumns = DT.genericToColumns
fromColumns = DT.genericFromColumns
Typed API
When you want compile-time guarantees that column names exist and types match, wrap your DataFrame in a TypedDataFrame:
type EmployeeSchema =
'[ DT.Column "name" Text
, DT.Column "department" Text
, DT.Column "salary" Double
]
employees <- D.readCsv "./data/employees.csv"
case DT.freeze @EmployeeSchema employees of
Nothing -> "Schema mismatch!"
Just tdf -> tdf
|> DT.derive @"bonus" (DT.col @"salary" * DT.lit 0.1)
|> DT.filterWhere (DT.col @"salary" DT..>. DT.lit 50000)
|> DT.select @'["name", "bonus"]
|> DT.thaw
|> D.toMarkdown'
name Text |
bonus Double |
| Alice |
8500.0 |
| Carol |
12000.0 |
| Dave |
5200.0 |
| Frank |
6700.0 |
DT.freeze validates the runtime DataFrame against your schema once at the boundary. After that, every column access is checked at compile time:
-- Typo in column name -> compile error
tdf |> DT.filterWhere (DT.col @"slary" DT..>. DT.lit 50000)
-- error: Column "slary" not found in schema
-- Wrong type -> compile error
tdf |> DT.filterWhere (DT.col @"name" DT..>. DT.lit 50000)
-- error: Couldn't match type 'Text' with 'Double'
filterAllJust goes further — it strips Maybe from every column in the schema type, so downstream code can't accidentally treat cleaned columns as nullable:
type ScoreSchema = '[ DT.Column "name" Text, DT.Column "score" (Maybe Double) ]
scoresDf = D.fromNamedColumns
[ ("name", D.fromList ["a", "b", "c" :: Text])
, ("score", D.fromList [Just 1.0, Nothing, Just 3.0 :: Maybe Double])
]
Just stdf = DT.freeze @ScoreSchema scoresDf
-- filterAllJust drops the null row and changes the column type from
-- (Maybe Double) to Double, so `scaled` can multiply it directly.
DT.thaw (DT.filterAllJust stdf |> DT.derive @"scaled" (DT.col @"score" * DT.lit 100)) |> D.toMarkdown'
name Text |
score Double |
scaled Double |
| a |
1.0 |
100.0 |
| c |
3.0 |
300.0 |
Features
I/O: CSV, TSV, Parquet (Snappy, ZSTD, Gzip), JSON. Read Parquet from HTTP URLs and Hugging Face datasets (hf:// URIs). Column projection and predicate pushdown for Parquet reads.
Operations: filter, select, derive, groupBy, aggregate, joins (inner, left, right, full outer), sort, sample, stratified sample, distinct, k-fold splits.
Expressions: typed column references (F.col @Double "x"), arithmetic, comparisons, logical operators, nullable-aware three-valued logic (.==, .&&), string matching (like, regex), casting, and user-defined functions via lift/lift2.
Statistics: mean, median, mode, variance, standard deviation, percentiles, inter-quartile range, correlation, skewness, frequency tables, imputation.
Plotting: terminal plots (histogram, scatter, line, bar, box, pie, heatmap, stacked bar, correlation matrix) and interactive HTML plots.
Lazy engine: streaming query execution for files that don't fit in memory. Rule-based optimizer with filter fusion, predicate pushdown, and dead column elimination. Pull-based executor with configurable batch sizes.
Interop: Arrow C Data Interface for zero-copy round-trips with Python and Polars.
ML: decision trees (TAO algorithm), feature synthesis, k-fold cross-validation, stratified sampling.
Notebooks: IHaskell integration with pre-built Binder examples.
Lazy Queries
For files too large to fit in memory, DataFrame.Lazy provides a streaming query engine. Declare a schema, build a query plan with the same familiar operations, and runDataFrame runs it through an optimizer before streaming results batch-by-batch:
import qualified DataFrame.Lazy as L
import DataFrame.Internal.Schema (schemaType, makeSchema)
housingSchema = makeSchema
[ ("longitude", schemaType @Double)
, ("latitude", schemaType @Double)
, ("housing_median_age", schemaType @Double)
, ("total_rooms", schemaType @Double)
, ("total_bedrooms", schemaType @(Maybe Double))
, ("population", schemaType @Double)
, ("households", schemaType @Double)
, ("median_income", schemaType @Double)
, ("median_house_value", schemaType @Double)
, ("ocean_proximity", schemaType @Text)
]
lazyResult <- L.runDataFrame $
L.scanCsv housingSchema "./data/housing.csv"
|> L.filter (F.col @Double "median_income" .>. F.lit 5)
|> L.derive "value_per_income"
(F.col @Double "median_house_value" / F.col @Double "median_income")
|> L.select ["ocean_proximity", "median_house_value", "value_per_income"]
|> L.take 1000
D.take 10 lazyResult |> D.toMarkdown'
ocean_proximity Text |
median_house_value Double |
value_per_income Double |
| NEAR BAY |
452600.0 |
54365.06029885168 |
| NEAR BAY |
358500.0 |
43185.48678536151 |
| NEAR BAY |
352100.0 |
48515.997464656764 |
| NEAR BAY |
341300.0 |
60480.94132657581 |
| NEAR BAY |
75000.0 |
12258.307046074891 |
| NEAR BAY |
262500.0 |
51554.49064163246 |
| NEAR BAY |
327600.0 |
55908.25312308007 |
| NEAR BAY |
347600.0 |
65748.65703260951 |
| NEAR BAY |
366100.0 |
61467.42780389524 |
| NEAR BAY |
373600.0 |
58895.860264211624 |
The optimizer pushes the filter into the scan, drops unreferenced columns before reading, and stops pulling batches once 1000 rows have been collected.
Documentation