| Safe Haskell | None |
|---|---|
| Language | Haskell2010 |
Text.HTML.Scalpel
Contents
Description
Scalpel is a web scraping library inspired by libraries like parsec and Perl's Web::Scraper. Scalpel builds on top of Text.HTML.TagSoup to provide a declarative and monadic interface.
There are two general mechanisms provided by this library that are used to build web scrapers: Selectors and Scrapers.
Selectors describe a location within an HTML DOM tree. The simplest selector,
that can be written is a simple string value. For example, the selector
"div" matches every single div node in a DOM. Selectors can be combined
using tag combinators. The // operator to define nested relationships
within a DOM tree. For example, the selector "div" // "a" matches all
anchor tags nested arbitrarily deep within a div tag.
In addition to describing the nested relationships between tags, selectors
can also include predicates on the attributes of a tag. The @: operator
creates a selector that matches a tag based on the name and various
conditions on the tag's attributes. An attribute predicate is just a function
that takes an attribute and returns a boolean indicating if the attribute
matches a criteria. There are several attribute operators that can be used
to generate common predicates. The @= operator creates a predicate that
matches the name and value of an attribute exactly. For example, the selector
"div" @: ["id" @= "article"] matches div tags where the id
attribute is equal to "article".
Scrapers are values that are parameterized over a selector and produce
a value from an HTML DOM tree. The Scraper type takes two type parameters.
The first is the string like type that is used to store the text values
within a DOM tree. Any string like type supported by Text.StringLike is
valid. The second type is the type of value that the scraper produces.
There are several scraper primitives that take selectors and extract content from the DOM. Each primitive defined by this library comes in two variants: singular and plural. The singular variants extract the first instance matching the given selector, while the plural variants match every instance.
The following is an example that demonstrates most of the features provided
by this library. Supposed you have the following hypothetical HTML located at
"http:/example.comarticle.html" and you would like to extract a list of
all of the comments.
<html>
<body>
<div class='comments'>
<div class='comment container'>
<span class='comment author'>Sally</span>
<div class='comment text'>Woo hoo!</div>
</div>
<div class='comment container'>
<span class='comment author'>Bill</span>
<img class='comment image' src='http://example.com/cat.gif' />
</div>\
<div class='comment container'>
<span class='comment author'>Susan</span>
<div class='comment text'>WTF!?!</div>
</div>
</div>
</body>
</html>The following snippet defines a function, allComments, that will download
the web page, and extract all of the comments into a list:
type Author = String
data Comment
= TextComment Author String
| ImageComment Author URL
allComments :: IO (Maybe [Comment])
allComments = scrapeURL "http://example.com/article.html" comments
where
comments :: Scraper String [Comment]
comments = chroots ("div" @: [hasClass "container"]) comment
comment :: Scraper String Comment
comment = textComment <|> imageComment
textComment :: Scraper String Comment
textComment = do
author <- text $ "span" @: [hasClass "author"]
commentText <- text $ "div" @: [hasClass "text"]
return $ TextComment author commentText
imageComment :: Scraper String Comment
imageComment = do
author <- text $ "span" @: [hasClass "author"]
imageURL <- attr "src" $ "img" @: [hasClass "image"]
return $ ImageComment author imageURL- data Selector str
- class Selectable str s | s -> str where
- toSelector :: s -> Selector str
- type AttributePredicate str = Attribute str -> Bool
- class AttributeName str k | k -> str
- class TagName str t | t -> str
- data Any str = Any
- (//) :: (StringLike str, Selectable str a, Selectable str b) => a -> b -> Selector str
- (@:) :: (StringLike str, TagName str tag) => tag -> [AttributePredicate str] -> Selector str
- (@=) :: (StringLike str, AttributeName str key) => key -> str -> AttributePredicate str
- (@=~) :: (StringLike str, AttributeName str key, RegexLike re str) => key -> re -> AttributePredicate str
- hasClass :: StringLike str => str -> AttributePredicate str
- select :: (StringLike str, Selectable str s) => s -> [Tag str] -> [[Tag str]]
- data Scraper str a
- attr :: (Show str, StringLike str, Selectable str s) => str -> s -> Scraper str str
- attrs :: (Show str, StringLike str, Selectable str s) => str -> s -> Scraper str [str]
- html :: (StringLike str, Selectable str s) => s -> Scraper str str
- htmls :: (StringLike str, Selectable str s) => s -> Scraper str [str]
- text :: (StringLike str, Selectable str s) => s -> Scraper str str
- texts :: (StringLike str, Selectable str s) => s -> Scraper str [str]
- chroot :: (StringLike str, Selectable str s) => s -> Scraper str a -> Scraper str a
- chroots :: (StringLike str, Selectable str s) => s -> Scraper str a -> Scraper str [a]
- scrape :: Scraper str a -> [Tag str] -> Maybe a
- scrapeStringLike :: StringLike str => str -> Scraper str a -> Maybe a
- type URL = String
- scrapeURL :: StringLike str => URL -> Scraper str a -> IO (Maybe a)
- scrapeURLWithOpts :: StringLike str => [CurlOption] -> URL -> Scraper str a -> IO (Maybe a)
Selectors
Selector defines a selection of an HTML DOM tree to be operated on by
a web scraper. The selection includes the opening tag that matches the
selection, all of the inner tags, and the corresponding closing tag.
Instances
| Selectable str (Selector str) |
class Selectable str s | s -> str where Source
The Selectable class defines a class of types that are capable of being
cast into a Selector which in turns describes a section of an HTML DOM
tree.
Methods
toSelector :: s -> Selector str Source
Instances
| Selectable String String | |
| Selectable ByteString ByteString | |
| Selectable ByteString ByteString | |
| Selectable Text Text | |
| Selectable Text Text | |
| Selectable str (Any str) | |
| Selectable str (Selector str) |
type AttributePredicate str = Attribute str -> Bool Source
An AttributePredicate is a method that takes a Attribute and
returns a Bool indicating if the given attribute matches a predicate.
class AttributeName str k | k -> str Source
The AttributeName class defines a class of types that can be used when
creating Selectors to specify the name of an attribute of a tag.
The most basic types of AttributeName are the string like types (e.g.
String, Text, etc). Values of these types refer to attributes with
names of that value.
In addition there is also the Any type which will match any attribute name.
Minimal complete definition
matchKey
class TagName str t | t -> str Source
The TagName class defines a class of types that can be used when creating
Selectors to specify the name of a tag.
The most basic types of TagName are the string like types (e.g. String,
Text, etc). Values of these types refer to tags of the given value.
In addition there is also the Any type which will match any tag.
Minimal complete definition
toSelectNode
Instances
Wildcards
Any can be used as a wildcard when constructing selectors to match tags
and attributes with any name.
For example, the selector Any @: [Any @= "foo"] matches all tags that
have any attribute where the value is "foo".
Constructors
| Any |
Instances
| TagName str (Any str) | |
| AttributeName str (Any str) | |
| Selectable str (Any str) |
Tag combinators
(//) :: (StringLike str, Selectable str a, Selectable str b) => a -> b -> Selector str infixl 5 Source
Attribute predicates
(@:) :: (StringLike str, TagName str tag) => tag -> [AttributePredicate str] -> Selector str infixl 9 Source
The @: operator creates a Selector by combining a TagName with a list
of AttributePredicates.
(@=) :: (StringLike str, AttributeName str key) => key -> str -> AttributePredicate str infixl 6 Source
The @= operator creates an AttributePredicate that will match
attributes with the given name and value.
If you are attempting to match a specific class of a tag with potentially
multiple classes, you should use the hasClass utility function.
(@=~) :: (StringLike str, AttributeName str key, RegexLike re str) => key -> re -> AttributePredicate str infixl 6 Source
The @=~ operator creates an AttributePredicate that will match
attributes with the given name and whose value matches the given regular
expression.
hasClass :: StringLike str => str -> AttributePredicate str Source
The classes of a tag are defined in HTML as a space separated list given by
the class attribute. The hasClass function will match a class attribute
if the given class appears anywhere in the space separated list of classes.
Executing selectors
select :: (StringLike str, Selectable str s) => s -> [Tag str] -> [[Tag str]] Source
The select function takes a Selectable value and a list of
Tags and returns a list of every subsequence of the given list of
Tags that matches the given selector.
Scrapers
A value of Scraper a defines a web scraper that is capable of consuming
a list of Tags and optionally producing a value of type a.
Instances
| Alternative (Scraper str) | |
| Monad (Scraper str) | |
| Functor (Scraper str) | |
| Applicative (Scraper str) |
Primitives
attr :: (Show str, StringLike str, Selectable str s) => str -> s -> Scraper str str Source
attrs :: (Show str, StringLike str, Selectable str s) => str -> s -> Scraper str [str] Source
The attrs function takes an attribute name and a selector and returns the
value of the attribute of the given name for every opening tag that matches
the given selector.
html :: (StringLike str, Selectable str s) => s -> Scraper str str Source
htmls :: (StringLike str, Selectable str s) => s -> Scraper str [str] Source
The htmls function takes a selector and returns the html string from every
set of tags matching the given selector.
text :: (StringLike str, Selectable str s) => s -> Scraper str str Source
texts :: (StringLike str, Selectable str s) => s -> Scraper str [str] Source
The texts function takes a selector and returns the inner text from every
set of tags matching the given selector.
chroot :: (StringLike str, Selectable str s) => s -> Scraper str a -> Scraper str a Source
The chroot function takes a selector and an inner scraper and executes
the inner scraper as if it were scraping a document that consists solely of
the tags corresponding to the selector.
This function will match only the first set of tags matching the selector, to
match every set of tags, use chroots.
chroots :: (StringLike str, Selectable str s) => s -> Scraper str a -> Scraper str [a] Source
The chroots function takes a selector and an inner scraper and executes
the inner scraper as if it were scraping a document that consists solely of
the tags corresponding to the selector. The inner scraper is executed for
each set of tags matching the given selector.
Executing scrapers
scrapeStringLike :: StringLike str => str -> Scraper str a -> Maybe a Source
The scrapeStringLike function parses a StringLike value into a list of
tags and executes a Scraper on it.
scrapeURLWithOpts :: StringLike str => [CurlOption] -> URL -> Scraper str a -> IO (Maybe a) Source
The scrapeURLWithOpts function take a list of curl options and downloads
the contents of the given URL and executes a Scraper on it.