krapsh-0.1.6.1: Haskell bindings for Spark Dataframes and Datasets

Safe HaskellNone
LanguageHaskell2010

Spark.Core.Internal.DatasetFunctions

Contents

Synopsis

Documentation

parents :: ComputeNode loc a -> [UntypedNode] -> ComputeNode loc a Source #

Adds parents to the node. It is assumed the parents are the unique set of nodes required by the operation defined in this node. If you want to set parents for the sake of organizing computation use logicalParents. If you want to add some timing dependencies between nodes, use depends.

untyped :: ComputeNode loc a -> UntypedNode Source #

Converts any node to an untyped node

logicalParents :: ComputeNode loc a -> [UntypedNode] -> ComputeNode loc a Source #

Establishes a naming convention on this node: the path of this node will be determined as if the parents of this node were the list provided (and without any effect from the direct parents of this node).

For this to work, the logical parents should split the nodes between internal nodes, logical parents, and the rest. In other words, for any ancestor of this node, and for any valid path to reach this ancestor, this path should include at least one node from the logical dependencies.

This set can be a super set of the actual logical parents.

The check is lazy (done during the analysis phase). An error (if any) will only be reported during analysis.

depends :: ComputeNode loc a -> [UntypedNode] -> ComputeNode loc a Source #

Sets the logical dependencies on this node.

All the nodes given will be guaranteed to be executed before the current node.

If there are any failures, this node will also be treated as a failure (even if the parents are all successes).

dataframe :: DataType -> [Cell] -> DataFrame Source #

Creates a dataframe from a list of cells and a datatype.

Wil fail if the content of the cells is not compatible with the data type.

asDF :: ComputeNode LocDistributed a -> DataFrame Source #

Converts to a dataframe and drops the type info. This always works.

asDS :: forall a. SQLTypeable a => DataFrame -> Try (Dataset a) Source #

Attempts to convert a dataframe into a (typed) dataset.

This will fail if the dataframe itself is a failure, of if the casting operation is not correct. This operation assumes that both field names and types are correct.

asLocalObservable :: ComputeNode LocLocal a -> LocalFrame Source #

Converts a local node to a local frame. This always works.

identity :: ComputeNode loc a -> ComputeNode loc a Source #

The identity function.

Returns a compute node with the same datatype and the same content as the previous node. If the operation of the input has a side effect, this side side effect is *not* reevaluated.

This operation is typically used when establishing an ordering between some operations such as caching or side effects, along with logicalDependencies.

autocache :: Dataset a -> Dataset a Source #

Automatically caches the dataset on a need basis, and performs deallocation when the dataset is not required.

This function marks a dataset as eligible for the default caching level in Spark. The current implementation performs caching only if it can be established that the dataset is going to be involved in more than one shuffling or aggregation operation.

If the dataset has no observable child, no uncaching operation is added: the autocache operation is equivalent to unconditional caching.

cache :: Dataset a -> Dataset a Source #

Caches the dataset.

This function instructs Spark to cache a dataset with the default persistence level in Spark (MEMORY_AND_DISK).

Note that the dataset will have to be evaluated first for the caching to take effect, so it is usual to call count or other aggregrators to force the caching to occur.

uncache :: ComputeNode loc a -> ComputeNode loc a Source #

Uncaches the dataset.

This function instructs Spark to unmark the dataset as cached. The disk and the memory used by Spark in the future.

Unlike Spark, Krapsh is stricter with the uncaching operation: - the argument of cache must be a cached dataset - once a dataset is uncached, its cached version cannot be used again (i.e. it must be recomputed).

Krapsh performs escape analysis and will refuse to run programs with caching issues.

union :: Dataset a -> Dataset a -> Dataset a Source #

Returns the union of two datasets.

In the context of streaming and differentiation, this union is biased towards the left: the left argument expresses the stream and the right element expresses the increment.

castLocality :: forall a loc loc'. (CheckedLocalityCast loc, CheckedLocalityCast loc') => ComputeNode loc a -> Try (ComputeNode loc' a) Source #

emptyNodeStandard :: forall loc a. TypedLocality loc -> SQLType a -> Text -> ComputeNode loc a Source #

nodeLogicalDependencies :: ComputeNode loc a -> [UntypedNode] Source #

Returns the logical dependencies of a node.

nodeLogicalParents :: ComputeNode loc a -> Maybe [UntypedNode] Source #

(developer) Returns the logical parenst of a node.

nodeName :: ComputeNode loc a -> NodeName Source #

The name of a node. TODO: should be a NodePath

nodePath :: ComputeNode loc a -> NodePath Source #

The path of a node, as resolved.

This path includes information about the logical parents (after resolution).

nodeOp :: ComputeNode loc a -> NodeOp Source #

(developer) The operation performed by this node.

nodeParents :: ComputeNode loc a -> [UntypedNode] Source #

The nodes this node depends on directly.

nodeType :: ComputeNode loc a -> SQLType a Source #

The type of the node TODO have nodeType' for dynamic types as well

updateNode :: ComputeNode loc a -> (ComputeNode loc a -> ComputeNode loc' b) -> ComputeNode loc' b Source #

fun1ToOpTyped :: forall a loc a' loc'. IsLocality loc => SQLType a -> (ComputeNode loc a -> ComputeNode loc' a') -> NodeOp Source #

(internal) conversion

fun2ToOpTyped :: forall a1 a2 a loc1 loc2 loc. (IsLocality loc1, IsLocality loc2) => SQLType a1 -> SQLType a2 -> (ComputeNode loc1 a1 -> ComputeNode loc2 a2 -> ComputeNode loc a) -> NodeOp Source #

(internal) conversion

nodeOpToFun1 :: forall a1 a2 loc1 loc2. (IsLocality loc1, SQLTypeable a2, IsLocality loc2) => NodeOp -> ComputeNode loc1 a1 -> ComputeNode loc2 a2 Source #

(internal) conversion

nodeOpToFun1Typed :: forall a1 a2 loc1 loc2. (HasCallStack, IsLocality loc1, IsLocality loc2) => SQLType a2 -> NodeOp -> ComputeNode loc1 a1 -> ComputeNode loc2 a2 Source #

(internal) conversion

nodeOpToFun1Untyped :: forall loc1 loc2. (HasCallStack, IsLocality loc1, IsLocality loc2) => DataType -> NodeOp -> ComputeNode loc1 Cell -> ComputeNode loc2 Cell Source #

(internal) conversion

nodeOpToFun2 :: forall a a1 a2 loc loc1 loc2. (SQLTypeable a, IsLocality loc, IsLocality loc1, IsLocality loc2) => NodeOp -> ComputeNode loc1 a1 -> ComputeNode loc2 a2 -> ComputeNode loc a Source #

(internal) conversion

nodeOpToFun2Typed :: forall a a1 a2 loc loc1 loc2. (IsLocality loc, IsLocality loc1, IsLocality loc2) => SQLType a -> NodeOp -> ComputeNode loc1 a1 -> ComputeNode loc2 a2 -> ComputeNode loc a Source #

(internal) conversion

Orphan instances