krapsh-0.1.6.1: Haskell bindings for Spark Dataframes and Datasets

Safe HaskellNone
LanguageHaskell2010

Spark.Core.Dataset

Contents

Description

This module describes the core data types (Dataset, DataFrame, Observable and DynObservable), and some basic operations to relate them.

Synopsis

Common data structures

data ComputeNode loc a Source #

(internal) The main data structure that represents a data node in the computation graph.

This data structure forms the backbone of computation graphs expressed with spark operations.

loc is a typed locality tag. a is the type of the data, as seen by the Haskell compiler. If erased, it will be a Cell type.

Instances

Eq (ComputeNode loc a) Source # 

Methods

(==) :: ComputeNode loc a -> ComputeNode loc a -> Bool #

(/=) :: ComputeNode loc a -> ComputeNode loc a -> Bool #

CanRename (ComputeNode loc a) String Source # 

Methods

(@@) :: ComputeNode loc a -> String -> ComputeNode loc a Source #

Distributed data structures

type Dataset a = ComputeNode LocDistributed a Source #

A typed collection of distributed data.

Most operations on datasets are type-checked by the Haskell compiler: the type tag associated to this dataset is guaranteed to be convertible to a proper Haskell type. In particular, building a Dataset of dynamic cells is guaranteed to never happen.

If you want to do untyped operations and gain some flexibility, consider using UDataFrames instead.

Computations with Datasets and observables are generally checked for correctness using the type system of Haskell.

type DataFrame = Try (Dataset Cell) Source #

The dataframe type. Any dataset can be converted to a dataframe.

For the Spark users: this is different than the definition of the dataframe in Spark, which is a dataset of rows. Because the support for single columns is more akward in the case of rows, it is more natural to generalize datasets to contain cells. When communicating with Spark, though, single cells are wrapped into rows with single field, as Spark does.

Local data structures

type LocalData a = ComputeNode LocLocal a Source #

A unit of data that can be accessed by the user.

This is a typed unit of data. The type is guaranteed to be a proper type accessible by the Haskell compiler (instead of simply a Cell type, which represents types only accessible at runtime).

TODO(kps) rename to Observable

type LocalFrame = Try (LocalData Cell) Source #

Observable, whose type can only be infered at runtime and that can fail to be computed at runtime.

Any observable can be converted to an untyped observable.

Untyped observables are more flexible and can be combined in arbitrary manner, but they will fail during the validation of the Spark computation graph.

TODO(kps) rename to DynObservable

Conversions

asDF :: ComputeNode LocDistributed a -> DataFrame Source #

Converts to a dataframe and drops the type info. This always works.

asDS :: forall a. SQLTypeable a => DataFrame -> Try (Dataset a) Source #

Attempts to convert a dataframe into a (typed) dataset.

This will fail if the dataframe itself is a failure, of if the casting operation is not correct. This operation assumes that both field names and types are correct.

asLocalObservable :: ComputeNode LocLocal a -> LocalFrame Source #

Converts a local node to a local frame. This always works.

Relations

parents :: ComputeNode loc a -> [UntypedNode] -> ComputeNode loc a Source #

Adds parents to the node. It is assumed the parents are the unique set of nodes required by the operation defined in this node. If you want to set parents for the sake of organizing computation use logicalParents. If you want to add some timing dependencies between nodes, use depends.

untyped :: ComputeNode loc a -> UntypedNode Source #

Converts any node to an untyped node

depends :: ComputeNode loc a -> [UntypedNode] -> ComputeNode loc a Source #

Sets the logical dependencies on this node.

All the nodes given will be guaranteed to be executed before the current node.

If there are any failures, this node will also be treated as a failure (even if the parents are all successes).

logicalParents :: ComputeNode loc a -> [UntypedNode] -> ComputeNode loc a Source #

Establishes a naming convention on this node: the path of this node will be determined as if the parents of this node were the list provided (and without any effect from the direct parents of this node).

For this to work, the logical parents should split the nodes between internal nodes, logical parents, and the rest. In other words, for any ancestor of this node, and for any valid path to reach this ancestor, this path should include at least one node from the logical dependencies.

This set can be a super set of the actual logical parents.

The check is lazy (done during the analysis phase). An error (if any) will only be reported during analysis.

Attributes

nodeLogicalParents :: ComputeNode loc a -> Maybe [UntypedNode] Source #

(developer) Returns the logical parenst of a node.

nodeLogicalDependencies :: ComputeNode loc a -> [UntypedNode] Source #

Returns the logical dependencies of a node.

nodeParents :: ComputeNode loc a -> [UntypedNode] Source #

The nodes this node depends on directly.

nodeOp :: ComputeNode loc a -> NodeOp Source #

(developer) The operation performed by this node.

nodeName :: ComputeNode loc a -> NodeName Source #

The name of a node. TODO: should be a NodePath

nodeType :: ComputeNode loc a -> SQLType a Source #

The type of the node TODO have nodeType' for dynamic types as well