follow-0.1.0.0: Haskell library to follow content published on any subject.

Safe HaskellNone
LanguageHaskell2010

Follow.Fetchers.WebScraping

Description

This module is the namespace to define a fetcher strategy which generates entries scraping the contents requested to an HTML URI.

A Selector must be given in order to know from where the information for each entry field should be taken.

Be aware that scraping an HTML page has very few consistency warantee. So, depending on the page structure and the selector you give, you could end up with 5 URIs, 4 titles and 6 descriptions. Keep in mind that the leading and limiting asset are the URIs, so in the previous scenario one Nothing title would be added and one description would be discarded.

Here it is an example:

import Follow
import Follow.Fetchers.WebScraping

selector :: Selector
selector = Selector {
    selURI = Just $ Attr ".title a" "href"
  , selGUID = Just $ Attr ".title a" "href"
  , selTitle = Just $ InnerText ".title a"
  , selDescription = Just $ InnerText ".description"
  , selAuthor = Just $ InnerText ".author"
  , selPublishDate = Nothing
}

result :: IO [Entry]
result = fetch ("http://an_url.com", selector)
Synopsis

Documentation

fetch :: (MonadThrow m, MonadIO m, MonadHttp m) => ByteString -> Selector -> Fetched m Source #

Fetches entries from given url using specified selectors.

data Selector Source #

Data type with the selectors to use when scraping each Entry item.

Instances
Eq Selector Source # 
Instance details

Defined in Follow.Fetchers.WebScraping.Internal

Show Selector Source # 
Instance details

Defined in Follow.Fetchers.WebScraping.Internal

FromJSON Selector #
  uri: # See SelectorItem instance
  title: null
  description: null
  guid: null
  author: null
  publish_date: null
 
Instance details

Defined in Follow.Parser

data SelectorItem Source #

Selector to use when scraping an Entry item.

Constructors

InnerText CSSSelector

This selector will take the inner text immediately descendant of a tag selected with given css selector.

Attr CSSSelector HTMLAttribute

This selector will take the value of given argument in the tag matched by given css selector.

Instances
Eq SelectorItem Source # 
Instance details

Defined in Follow.Fetchers.WebScraping.Internal

Show SelectorItem Source # 
Instance details

Defined in Follow.Fetchers.WebScraping.Internal

FromJSON SelectorItem #
  type: text
  options:
    css: .selector
  

or

  type: attr
  options:
    css: .link
    name: href
  
Instance details

Defined in Follow.Parser

type CSSSelector = Text Source #

A CSS2 selector.

type HTMLAttribute = Text Source #

An HTML attribute name.