-- | Zenacy HTML is an HTML parsing and processing library that implements the
-- WHATWG HTML parsing standard. The standard is described as a state machine
-- that this library implements exactly as spelled out including all the error
-- handling, recovery, and conformance checks that makes it robust in handling
-- any HTML pulled from the web. In addition to parsing, the library provides
-- many processing features to help extract information from web pages or
-- rewrite them and render the modified results.
module Zenacy.HTML
(
-- * Introduction
-- $intro
-- * Parsing
-- $parse
-- * Rewriting
-- $rewrite
-- * Extraction
-- $extract
-- * Queries
-- $query
-- * Samples
-- $samples
-- * Origin
-- $history
module X
) where
import Zenacy.HTML.Internal.HTML as X
import Zenacy.HTML.Internal.Filter as X
import Zenacy.HTML.Internal.Image as X
import Zenacy.HTML.Internal.Oper as X
import Zenacy.HTML.Internal.Query as X
import Zenacy.HTML.Internal.Render as X
import Zenacy.HTML.Internal.Zip as X
-- $intro
--
-- The Zenacy HTML parser is an implementation of the HTML parsing standard
-- defined by the WHATWG.
--
-- https://html.spec.whatwg.org/multipage/parsing.html
--
-- The standard defines a parsing state machine, so it is very prescriptive
-- on how HTML is handled including many edge cases and error recovery.
-- This library aims to follow the standard closely in such a way to match the
-- code back to the standard and make future updates straightforward.
--
-- One of the main uses an a HTML parser is for extracting information from
-- the web. Having a parser that can handle all the nuances of poorly
-- formatted HTML helps to make this extraction as robust as possible.
-- This was a key motivation in deciding to implement a parser in this fashion.
-- Additionally, the standard describes the algorithms needed to produce the
-- correct document structure. Applications that are sensitive to the
-- document structure, such as extracting and rewriting large portions of
-- a web page, may benefit from Zenacy HTML.
--
-- The library provides a wide variety of features including:
--
-- * A fully standard compliant HTML parser
-- * HTML Fragment parsing
-- * Document rendering
-- * A zipper type for document traversal
-- * An iterator type for document walking
-- * Various functions for processing aspects of HTML
-- * Lightweight queries for rewriting
--
-- $parse
--
-- The library is designed to be imported unqualified.
--
-- > import Zenacy.HTML
--
-- The `htmlParseEasy` function can be used to parse an HTML document string
-- and return the document model.
--
-- > htmlParseEasy "
HelloWorld
"
--
-- Note that some of the missing elements where automatically added to
-- the document structure as required by the standard.
--
-- > HTMLDocument ""
-- > [ HTMLElement "html" HTMLNamespaceHTML []
-- > [ HTMLElement "head" HTMLNamespaceHTML [] []
-- > , HTMLElement "body" HTMLNamespaceHTML []
-- > [ HTMLElement "div" HTMLNamespaceHTML []
-- > [ HTMLText "HelloWorld" ] ] ] ]
--
-- The parsed result can also be rendered using `htmlRender`.
--
-- > htmlRender $ htmlParseEasy "HelloWorld
"
--
-- The resulting rendered document appears like so.
--
-- > HelloWorld
--
-- $rewrite
--
-- This example illustrates a function that converts span elements to divs.
--
-- > rewrite :: Text -> Text
-- > rewrite = htmlRender . htmlMapElem f . fromJust . htmlDocHtml . htmlParseEasy
-- > where
-- > f x
-- > | htmlElemHasName "span" x = htmlElemRename "div" x
-- > | otherwise = x
-- >
-- > rewrite "HelloWorld"
--
-- Running the above gives the modified document.
--
-- > Hello
World
--
-- $extract
--
-- The next example shows one way to find all the hyperlinks in a document.
-- This solution recurses over the document elements while ignoring fragments
-- and templates.
--
-- > extract :: Text -> [Text]
-- > extract = go . htmlParseEasy
-- > where
-- > go = \case
-- > HTMLDocument n c ->
-- > concatMap go c
-- > e @ (HTMLElement "a" s a c) ->
-- > case htmlElemAttrFind (htmlAttrHasName "href") e of
-- > Just (HTMLAttr n v s) ->
-- > v : concatMap go c
-- > Nothing ->
-- > concatMap go c
-- > HTMLElement n s a c ->
-- > concatMap go c
-- > _otherwise ->
-- > []
-- >
-- > extract ""
--
-- The extract function will give the following list.
--
-- > [ "https://example1.com"
-- > , "https://example2.com"
-- > ]
--
-- $query
--
-- The library includes a basic query facility implemented as a thin wrapper
-- around an `HTMLZipper`. Queries match patterns in HTML structures and can
-- be used to extract information or update documents. As a first example,
-- consider the following HTML.
--
-- >
-- >
-- >
-- > AAA
-- >
-- >
--
-- The HTML can be parsed as normal. Note though the additional step of
-- whitespace removal, which is often important in documents that include
-- indentation such as above.
--
-- > fromJust . htmlSpaceRemove . fromJust . htmlDocBody . htmlParseEasy
--
-- Now a query function can be defined. This function expects to be given
-- a @body@ element whose first child is a @p@ element whose first child
-- has an id of @x@ whose second sibling is an anchor element. If all of
-- those conditions are met, the the text contents of the anchor is returned.
--
-- > query :: HTMLNode -> Maybe Text
-- > query = htmlQueryExec $ do
-- > htmlQueryName "body"
-- > htmlQueryFirst
-- > htmlQueryName "p"
-- > htmlQueryFirst
-- > htmlQueryId "x"
-- > htmlQueryNext
-- > htmlQueryNext
-- > htmlQueryName "a"
-- > a <- htmlQueryNode
-- > htmlQuerySucc $
-- > fromMaybe "" $ htmlElemText a
--
-- Running the query on the parsed document will give the result.
--
-- > Just "AAA"
--
-- Queries can also be used to modifiy documents. In the next example, let's
-- say we would like to find any @img@ that is the only content in a @div@ and
-- replace the @div@ with a link. The document could look as follows.
--
-- >
-- >
-- >
--
-- A query function can be defined to match the desired pattern and return the
-- modified element.
--
-- > query2 :: HTMLNode -> HTMLNode
-- > query2 = htmlQueryTry $ do
-- > htmlQueryName "div"
-- > htmlQueryOnly "img"
-- > a <- htmlQueryNode
-- > let Just b = htmlElemGetAttr "src" a
-- > htmlQuerySucc $
-- > htmlElem "a" [ htmlAttr "href" b ]
-- > [ htmlText b ]
--
-- The query can then be applied to the entire document using `htmlMapElem`.
--
-- > htmlMapElem query2
--
-- Rendering the mapped query with give the updated content.
--
-- >
-- >
-- >
--
-- $samples
--
-- The unit tests include the above samples as well as many other example
-- usages of the library.
--
-- $history
--
-- Zenacy HTML was originally developed for Zenacy Reader Technologies LLC
-- starting around 2015 and used in a web reading SaaS for a few years.
-- The need to understand and handle the wide variety and sublties of HTML
-- found on the web lead to the development of library that closely followed
-- the standard. The library was tweaked and optimized a bit and though
-- there is room for more improvements the result worked quite well in
-- production (a lot of credit goes to the GHC team and Haskell community
-- for providing such great, fast functional programming tools).