Safe Haskell | None |
---|---|
Language | Haskell2010 |
This module is the namespace to define a fetcher strategy which generates entries scraping the contents requested to an HTML URI.
A Selector
must be given in order to know from where the information
for each entry field should be taken.
Be aware that scraping an HTML page has very few consistency
warantee. So, depending on the page structure and the selector you
give, you could end up with 5 URIs, 4 titles and 6 descriptions. Keep
in mind that the leading and limiting asset are the URIs, so in the
previous scenario one Nothing
title would be added and one
description would be discarded.
Here it is an example:
import Follow import Follow.Fetchers.WebScraping selector :: Selector selector = Selector { selURI = Just $ Attr ".title a" "href" , selGUID = Just $ Attr ".title a" "href" , selTitle = Just $ InnerText ".title a" , selDescription = Just $ InnerText ".description" , selAuthor = Just $ InnerText ".author" , selPublishDate = Nothing } result :: IO [Entry] result = fetch ("http://an_url.com", selector)
Synopsis
- fetch :: (MonadThrow m, MonadIO m, MonadHttp m) => ByteString -> Selector -> Fetched m
- data Selector = Selector {}
- data SelectorItem
- type CSSSelector = Text
- type HTMLAttribute = Text
Documentation
fetch :: (MonadThrow m, MonadIO m, MonadHttp m) => ByteString -> Selector -> Fetched m Source #
Fetches entries from given url using specified selectors.
Data type with the selectors to use when scraping each Entry
item.
data SelectorItem Source #
Selector to use when scraping an Entry
item.
InnerText CSSSelector | This selector will take the inner text immediately descendant of a tag selected with given css selector. |
Attr CSSSelector HTMLAttribute | This selector will take the value of given argument in the tag matched by given css selector. |
Instances
Eq SelectorItem Source # | |
Defined in Follow.Fetchers.WebScraping.Internal (==) :: SelectorItem -> SelectorItem -> Bool # (/=) :: SelectorItem -> SelectorItem -> Bool # | |
Show SelectorItem Source # | |
Defined in Follow.Fetchers.WebScraping.Internal showsPrec :: Int -> SelectorItem -> ShowS # show :: SelectorItem -> String # showList :: [SelectorItem] -> ShowS # | |
FromJSON SelectorItem # | type: text options: css: .selector or type: attr options: css: .link name: href |
Defined in Follow.Parser parseJSON :: Value -> Parser SelectorItem # parseJSONList :: Value -> Parser [SelectorItem] # |
type CSSSelector = Text Source #
A CSS2 selector.
type HTMLAttribute = Text Source #
An HTML attribute name.