Safe Haskell | None |
---|---|
Language | Haskell2010 |
Scalpel core provides a subset of the scalpel web scraping library that is intended to have lightweight dependencies and to be free of all non-Haskell dependencies.
Notably this package does not contain any networking support. Users who
desire a batteries include solution should depend on scalpel
which does
include networking support instead of scalpel-core
.
More thorough documentation including example code can be found in the documentation of the scalpel package.
- data Selector
- data AttributePredicate
- data AttributeName
- data TagName
- tagSelector :: String -> Selector
- anySelector :: Selector
- (//) :: Selector -> Selector -> Selector
- (@:) :: TagName -> [AttributePredicate] -> Selector
- (@=) :: AttributeName -> String -> AttributePredicate
- (@=~) :: RegexLike re String => AttributeName -> re -> AttributePredicate
- hasClass :: String -> AttributePredicate
- notP :: AttributePredicate -> AttributePredicate
- match :: (String -> String -> Bool) -> AttributePredicate
- data Scraper str a
- attr :: (Ord str, Show str, StringLike str) => String -> Selector -> Scraper str str
- attrs :: (Ord str, Show str, StringLike str) => String -> Selector -> Scraper str [str]
- html :: (Ord str, StringLike str) => Selector -> Scraper str str
- htmls :: (Ord str, StringLike str) => Selector -> Scraper str [str]
- innerHTML :: (Ord str, StringLike str) => Selector -> Scraper str str
- innerHTMLs :: (Ord str, StringLike str) => Selector -> Scraper str [str]
- text :: (Ord str, StringLike str) => Selector -> Scraper str str
- texts :: (Ord str, StringLike str) => Selector -> Scraper str [str]
- chroot :: (Ord str, StringLike str) => Selector -> Scraper str a -> Scraper str a
- chroots :: (Ord str, StringLike str) => Selector -> Scraper str a -> Scraper str [a]
- position :: (Ord str, StringLike str) => Scraper str Int
- scrape :: (Ord str, StringLike str) => Scraper str a -> [Tag str] -> Maybe a
- scrapeStringLike :: (Ord str, StringLike str) => str -> Scraper str a -> Maybe a
Selectors
Selector
defines a selection of an HTML DOM tree to be operated on by
a web scraper. The selection includes the opening tag that matches the
selection, all of the inner tags, and the corresponding closing tag.
data AttributePredicate Source #
An AttributePredicate
is a method that takes a Attribute
and
returns a Bool
indicating if the given attribute matches a predicate.
data AttributeName Source #
The AttributeName
type can be used when creating Selector
s to specify
the name of an attribute of a tag.
tagSelector :: String -> Selector Source #
Wildcards
anySelector :: Selector Source #
A selector which will match all tags
Tag combinators
Attribute predicates
(@:) :: TagName -> [AttributePredicate] -> Selector infixl 9 Source #
The @:
operator creates a Selector
by combining a TagName
with a list
of AttributePredicate
s.
(@=) :: AttributeName -> String -> AttributePredicate infixl 6 Source #
The @=
operator creates an AttributePredicate
that will match
attributes with the given name and value.
If you are attempting to match a specific class of a tag with potentially
multiple classes, you should use the hasClass
utility function.
(@=~) :: RegexLike re String => AttributeName -> re -> AttributePredicate infixl 6 Source #
The @=~
operator creates an AttributePredicate
that will match
attributes with the given name and whose value matches the given regular
expression.
hasClass :: String -> AttributePredicate Source #
The classes of a tag are defined in HTML as a space separated list given by
the class
attribute. The hasClass
function will match a class
attribute
if the given class appears anywhere in the space separated list of classes.
notP :: AttributePredicate -> AttributePredicate Source #
Negates an AttributePredicate
.
match :: (String -> String -> Bool) -> AttributePredicate Source #
The match
function allows for the creation of arbitrary
AttributePredicate
s. The argument is a function that takes the attribute
key followed by the attribute value and returns a boolean indicating if the
attribute satisfies the predicate.
Scrapers
Primitives
attrs :: (Ord str, Show str, StringLike str) => String -> Selector -> Scraper str [str] Source #
The attrs
function takes an attribute name and a selector and returns the
value of the attribute of the given name for every opening tag that matches
the given selector.
htmls :: (Ord str, StringLike str) => Selector -> Scraper str [str] Source #
The htmls
function takes a selector and returns the html string from
every set of tags matching the given selector.
innerHTML :: (Ord str, StringLike str) => Selector -> Scraper str str Source #
The innerHTML
function takes a selector and returns the inner html string
from the set of tags described by the given selector. Inner html here meaning
the html within but not including the selected tags.
This function will match only the first set of tags matching the selector, to
match every set of tags, use innerHTMLs
.
innerHTMLs :: (Ord str, StringLike str) => Selector -> Scraper str [str] Source #
The innerHTMLs
function takes a selector and returns the inner html
string from every set of tags matching the given selector.
texts :: (Ord str, StringLike str) => Selector -> Scraper str [str] Source #
The texts
function takes a selector and returns the inner text from every
set of tags matching the given selector.
chroot :: (Ord str, StringLike str) => Selector -> Scraper str a -> Scraper str a Source #
The chroot
function takes a selector and an inner scraper and executes
the inner scraper as if it were scraping a document that consists solely of
the tags corresponding to the selector.
This function will match only the first set of tags matching the selector, to
match every set of tags, use chroots
.
chroots :: (Ord str, StringLike str) => Selector -> Scraper str a -> Scraper str [a] Source #
The chroots
function takes a selector and an inner scraper and executes
the inner scraper as if it were scraping a document that consists solely of
the tags corresponding to the selector. The inner scraper is executed for
each set of tags matching the given selector.
position :: (Ord str, StringLike str) => Scraper str Int Source #
The position
function is intended to be used within the do-block of a
chroots
call. Within the do-block position will return the index of the
current sub-tree within the list of all sub-trees matched by the selector
passed to chroots
.
For example, consider the following HTML:
<article> <p> First paragraph. </p> <p> Second paragraph. </p> <p> Third paragraph. </p> </article>
The position
function can be used to determine the index of each <p>
tag
within the article
tag by doing the following.
chroots "article" // "p" $ do index <- position content <- text "p" return (index, content)
Which will evaluate to the list:
[ (0, "First paragraph.") , (1, "Second paragraph.") , (2, "Third paragraph.") ]
Executing scrapers
scrapeStringLike :: (Ord str, StringLike str) => str -> Scraper str a -> Maybe a Source #
The scrapeStringLike
function parses a StringLike
value into a list of
tags and executes a Scraper
on it.