krapsh-0.1.6.1: Haskell bindings for Spark Dataframes and Datasets

Safe HaskellNone
LanguageHaskell2010

Spark.Core.Internal.DatasetStructures

Synopsis

Documentation

data ComputeNode loc a Source #

(internal) The main data structure that represents a data node in the computation graph.

This data structure forms the backbone of computation graphs expressed with spark operations.

loc is a typed locality tag. a is the type of the data, as seen by the Haskell compiler. If erased, it will be a Cell type.

Constructors

ComputeNode 

Fields

  • _cnNodeId :: NodeId

    The id of the node.

    Non strict because it may be expensive.

  • _cnOp :: !NodeOp

    The operation associated to this node.

  • _cnType :: !DataType

    The type of the node

  • _cnParents :: !(Vector UntypedNode)

    The direct parents of the node. The order of the parents is important for the semantics of the operation.

  • _cnLogicalDeps :: !(Vector UntypedNode)

    A set of extra dependencies that can be added to force an order between the nodes.

    The order is not important, they are sorted by ID.

    TODO(kps) add this one to the id

  • _cnLocality :: !Locality

    The locality of this node.

    TODO(kps) add this one to the id

  • _cnName :: !(Maybe NodeName)

    The name

  • _cnLogicalParents :: !(Maybe (Vector UntypedNode))

    A set of nodes considered as the logical input for this node. This has no influence on the calculation of the id and is used for organization purposes only.

  • _cnPath :: NodePath

    The path of this oned in a computation flow.

    This path includes the node name. Not strict because it may be expensive to compute. By default it only contains the name of the node (i.e. the node is attached to the root)

Instances

Eq (ComputeNode loc a) Source # 

Methods

(==) :: ComputeNode loc a -> ComputeNode loc a -> Bool #

(/=) :: ComputeNode loc a -> ComputeNode loc a -> Bool #

CanRename (ComputeNode loc a) String Source # 

Methods

(@@) :: ComputeNode loc a -> String -> ComputeNode loc a Source #

data TypedLocality loc Source #

Constructors

TypedLocality 

Instances

type Dataset a = ComputeNode LocDistributed a Source #

A typed collection of distributed data.

Most operations on datasets are type-checked by the Haskell compiler: the type tag associated to this dataset is guaranteed to be convertible to a proper Haskell type. In particular, building a Dataset of dynamic cells is guaranteed to never happen.

If you want to do untyped operations and gain some flexibility, consider using UDataFrames instead.

Computations with Datasets and observables are generally checked for correctness using the type system of Haskell.

type LocalData a = ComputeNode LocLocal a Source #

A unit of data that can be accessed by the user.

This is a typed unit of data. The type is guaranteed to be a proper type accessible by the Haskell compiler (instead of simply a Cell type, which represents types only accessible at runtime).

TODO(kps) rename to Observable

type DataFrame = Try (Dataset Cell) Source #

The dataframe type. Any dataset can be converted to a dataframe.

For the Spark users: this is different than the definition of the dataframe in Spark, which is a dataset of rows. Because the support for single columns is more akward in the case of rows, it is more natural to generalize datasets to contain cells. When communicating with Spark, though, single cells are wrapped into rows with single field, as Spark does.

type LocalFrame = Try (LocalData Cell) Source #

Observable, whose type can only be infered at runtime and that can fail to be computed at runtime.

Any observable can be converted to an untyped observable.

Untyped observables are more flexible and can be combined in arbitrary manner, but they will fail during the validation of the Spark computation graph.

TODO(kps) rename to DynObservable

data NodeEdge Source #

The different paths of edges in the compute DAG of nodes, at the start of computations.

  • scope edges specify the scope of a node for naming. They are not included in the id.

data StructureEdge Source #

The edges in a compute DAG, after name resolution (which is where most of the checks and computations are being done)

  • parent edges are the direct parents of a node, the only ones required for defining computations. They are included in the id.
  • logical edges define logical dependencies between nodes to force a specific ordering of the nodes. They are included in the id.

Constructors

ParentEdge 
LogicalEdge