pdftotext: Extracts text from PDF using poppler

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

The pdftotext package provides functions for extraction of plain text from PDF documents. It uses C++ library Poppler, which is required to be installed in the system. Output of Haskell pdftotext library is identical to output of Poppler's tool pdftotext.


[Skip to Readme]

Properties

Versions 0.0.1.0, 0.0.2.0, 0.0.2.0, 0.1.0.0, 0.1.0.1
Change log CHANGELOG.md
Dependencies base (>=4.11 && <5), bytestring (>=0.10 && <0.11), text (>=1.2 && <1.3), xml-conduit (>=1.8 && <1.9) [details]
License BSD-3-Clause
Copyright 2020 G. Eyaeb
Author G. Eyaeb
Maintainer geyaeb@protonmail.com
Category Text, PDF
Home page https://sr.ht/~geyaeb/haskell-pdftotext/
Bug tracker https://todo.sr.ht/~geyaeb/haskell-pdftotext
Source repo head: hg clone https://hg.sr.ht/~geyaeb/haskell-pdftotext
Uploaded by geyaeb at 2020-06-11T13:19:06Z

Modules

Flags

Automatic Flags
NameDescriptionDefault
xml-conduit

Parse metadata of PDF document properties using xml-conduit

Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees


Readme for pdftotext-0.0.2.0

[back to package description]

pdftotext

The pdftotext package provides functions for extraction of plain text from PDF documents. It uses C++ library Poppler, which is required to be installed in the system. Output of Haskell pdftotext library is identical to output of Poppler's tool pdftotext.

Usage

import qualified Data.Text.IO as T
import Pdftotext

main :: IO ()
main = do
  Just pdf <- openFile "path/to/file.pdf"
  T.putStrLn $ pdftotext Physical pdf

Flags

xml-conduit

pdftotext can extract properties from PDF document. One of them is metadata which is in form of XML document. If xml-conduit flag is set then the metadata is parsed using xml-conduit package, otherwise provided as text.

Internals

The library uses poppler via FFI, therefore internally all functions are of type IO. However, their non-IO variants (using unsafePerformIO) should be safe to use. Module Pdftotext.Internal exposes all IO-typed functions.

Contribute

Project is hosted at https://sr.ht/~geyaeb/haskell-pdftotext/ . The homepage provides links to Mercurial repository, mailing list and ticket tracker.

Patches, suggestions, questions and general discussions can be send to the mailing list. Detailed information about sending patches by email can be found at https://man.sr.ht/hg.sr.ht/email.md.