hadoop-streaming: A simple Hadoop streaming library

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

A simple Hadoop streaming library based on conduit, useful for writing mapper and reducer logic in Haskell and running it on AWS Elastic MapReduce, Azure HDInsight, GCP Dataproc, and so forth.


[Skip to Readme]

Properties

Versions 0.1.0.0, 0.2.0.0, 0.2.0.1, 0.2.0.2, 0.2.0.2, 0.2.0.3
Change log CHANGELOG.md
Dependencies base (>=4.12 && <5), bytestring (==0.10.*), conduit (>=1.3.1 && <1.4), extra (>=1.6.18 && <1.8), text (>=1.2.4.0 && <1.3) [details]
License BSD-3-Clause
Copyright 2020 Ziyang Liu
Author Ziyang Liu <free@cofree.io>
Maintainer Ziyang Liu <free@cofree.io>
Category Cloud, Distributed Computing, MapReduce
Home page https://github.com/zliu41/hadoop-streaming
Bug tracker https://github.com/zliu41/hadoop-streaming/issues
Source repo head: git clone https://github.com/zliu41/hadoop-streaming
Uploaded by zliu41 at 2020-04-06T16:17:22Z

Modules

[Index]

Downloads

Maintainer's Corner

For package maintainers and hackage trustees


Readme for hadoop-streaming-0.2.0.2

[back to package description]

A simple Hadoop streaming library based on conduit, useful for writing mapper and reducer logic in Haskell and running it on AWS Elastic MapReduce, Azure HDInsight, GCP Dataproc, and so forth.

Hackage: https://hackage.haskell.org/package/hadoop-streaming

Word Count Example

See the Haddock in HadoopStreaming.Text for a simple word-count example.

A Few Things to Note

ByteString vs Text

The HadoopStreaming module provides the general Mapper and Reducer data types, whose input and output types are abstract. They are usually instantiated with either ByteString or Text. ByteString is more suitable if the input/output needs to be decoded/encoded, for instance using the base64-bytestring library. On the other hand, Text could make more sense if decoding/encoding is not needed, or if the data is not UTF-8 encoded (see below regarding encodings). In general I'd imagine ByteString being used much more often than Text.

The HadoopStreaming.ByteString and HadoopStreaming.Text modules provide some utilities for working with ByteString and Text, respectively.

Encoding

It is highly recommended that your input data be UTF-8 encoded, as this is the default encoding Hadoop uses. If you must use other encodings such as UTF-16, keep in mind the following gotchas: