Safe Haskell | None |
---|---|
Language | Haskell2010 |
WARC (or Web ARCive) is a archival file format widely used to distribute corpora of crawled web content (see, for instance the Common Crawl corpus). A WARC file consists of a set of records, each of which describes a web request or response.
This module provides a streaming parser and encoder for WARC archives for use
with the pipes
package.
Here is a simple example which walks throught the WARC file:
{-# LANGUAGE RecordWildCards #-} {-# LANGUAGE OverloadedStrings #-} module Main where import Control.Lens import Control.Monad.IO.Class import qualified Data.ByteString as B import Data.Warc import qualified Pipes as P import Pipes.ByteString (fromHandle) import System.IO iterFunc :: Record IO b -> IO b iterFunc Record {..} = do case recHeader ^. recHeaders . at "Content-Type" of Just ct -> liftIO $ putStrLn ("Content-Type: " ++ show ct) Nothing -> return () r <- liftIO $ P.runEffect $ P.for recContent $ \x -> do liftIO $ putStrLn ("Got bytes: " ++ show (B.length x)) return () return r main :: IO () main = do withFile "example.warc" ReadMode $ \h -> do _ <- iterRecords iterFunc (parseWarc (fromHandle h)) return ()
Synopsis
- type Warc m a = FreeT (Record m) m (Producer ByteString m a)
- data Record m r = Record {
- recHeader :: RecordHeader
- recContent :: Producer ByteString m r
- parseWarc :: (Functor m, Monad m) => Producer ByteString m a -> Warc m a
- iterRecords :: forall m a. Monad m => (forall b. Record m b -> m b) -> Warc m a -> m (Producer ByteString m a)
- produceRecords :: forall m o a. Monad m => (forall b. RecordHeader -> Producer ByteString m b -> Producer o m b) -> Warc m a -> Producer o m (Producer ByteString m a)
- encodeRecord :: Monad m => Record m a -> Producer ByteString m a
- module Data.Warc.Header
Documentation
type Warc m a = FreeT (Record m) m (Producer ByteString m a) Source #
A WARC archive.
This represents a sequence of records followed by whatever data was leftover from the parse.
A WARC record
This represents a single record of a WARC file, consisting of a set of headers and a means of producing the record's body.
Record | |
|
Parsing
:: (Functor m, Monad m) | |
=> Producer ByteString m a | a producer of a stream of WARC content |
-> Warc m a | the parsed WARC archive |
Parse a WARC archive.
Note that this function does not actually do any parsing itself;
it merely returns a Warc
value which can then be run to parse
individual records.
:: Monad m | |
=> (forall b. Record m b -> m b) | the action to run on each |
-> Warc m a | the |
-> m (Producer ByteString m a) | returns any leftover data |
Iterate over the Record
s in a WARC archive
:: Monad m | |
=> (forall b. RecordHeader -> Producer ByteString m b -> Producer o m b) | consume the record producing some output |
-> Warc m a | a WARC archive (see |
-> Producer o m (Producer ByteString m a) | returns any leftover data |
Encoding
encodeRecord :: Monad m => Record m a -> Producer ByteString m a Source #
Encode a Record
in WARC format.
Headers
module Data.Warc.Header