-- | Zenacy HTML is an HTML parsing and processing library that implements the
-- WHATWG HTML parsing standard.  The standard is described as a state machine
-- that this library implements exactly as spelled out including all the error
-- handling, recovery, and conformance checks that makes it robust in handling
-- any HTML pulled from the web.  In addition to parsing, the library provides
-- many processing features to help extract information from web pages or
-- rewrite them and render the modified results.
module Zenacy.HTML
  (
  -- * Introduction
  -- $intro

  -- * Parsing
  -- $parse

  -- * Rewriting
  -- $rewrite

  -- * Extraction
  -- $extract

  -- * Queries
  -- $query

  -- * Samples
  -- $samples

  -- * Origin
  -- $history

  module X
  ) where

import Zenacy.HTML.Internal.HTML as X
import Zenacy.HTML.Internal.Filter as X
import Zenacy.HTML.Internal.Image as X
import Zenacy.HTML.Internal.Oper as X
import Zenacy.HTML.Internal.Query as X
import Zenacy.HTML.Internal.Render as X
import Zenacy.HTML.Internal.Zip as X

-- $intro
--
-- The Zenacy HTML parser is an implementation of the HTML parsing standard
-- defined by the WHATWG.
--
-- https://html.spec.whatwg.org/multipage/parsing.html
--
-- The standard defines a parsing state machine, so it is very prescriptive
-- on how HTML is handled including many edge cases and error recovery.
-- This library aims to follow the standard closely in such a way to match the
-- code back to the standard and make future updates straightforward.
--
-- One of the main uses an a HTML parser is for extracting information from
-- the web.  Having a parser that can handle all the nuances of poorly
-- formatted HTML helps to make this extraction as robust as possible.
-- This was a key motivation in deciding to implement a parser in this fashion.
-- Additionally, the standard describes the algorithms needed to produce the
-- correct document structure.  Applications that are sensitive to the
-- document structure, such as extracting and rewriting large portions of
-- a web page, may benefit from Zenacy HTML.
--
-- The library provides a wide variety of features including:
--
-- * A fully standard compliant HTML parser
-- * HTML Fragment parsing
-- * Document rendering
-- * A zipper type for document traversal
-- * An iterator type for document walking
-- * Various functions for processing aspects of HTML
-- * Lightweight queries for rewriting
--
-- $parse
--
-- The library is designed to be imported unqualified.
--
-- > import Zenacy.HTML
--
-- The `htmlParseEasy` function can be used to parse an HTML document string
-- and return the document model.
--
-- > htmlParseEasy "<div>HelloWorld</div>"
--
-- Note that some of the missing elements where automatically added to
-- the document structure as required by the standard.
--
-- > HTMLDocument ""
-- >   [ HTMLElement "html" HTMLNamespaceHTML []
-- >     [ HTMLElement "head" HTMLNamespaceHTML [] []
-- >     , HTMLElement "body" HTMLNamespaceHTML []
-- >       [ HTMLElement "div" HTMLNamespaceHTML []
-- >         [ HTMLText "HelloWorld" ] ] ] ]
--
-- The parsed result can also be rendered using `htmlRender`.
--
-- > htmlRender $ htmlParseEasy "<div>HelloWorld</div>"
--
-- The resulting rendered document appears like so.
--
-- > <html><head></head><body><div>HelloWorld</div></body></html>
--
-- $rewrite
--
-- This example illustrates a function that converts span elements to divs.
-- 
-- > rewrite :: Text -> Text
-- > rewrite = htmlRender . htmlMapElem f . fromJust . htmlDocHtml . htmlParseEasy
-- >   where
-- >     f x
-- >       | htmlElemHasName "span" x = htmlElemRename "div" x
-- >       | otherwise = x
-- >
-- > rewrite "<span>Hello</span><span>World</span>"
--
-- Running the above gives the modified document.
--
-- > <html><head></head><body><div>Hello</div><div>World</div></body></html>
--
-- $extract
--
-- The next example shows one way to find all the hyperlinks in a document.
-- This solution recurses over the document elements while ignoring fragments
-- and templates.
--
-- > extract :: Text -> [Text]
-- > extract = go . htmlParseEasy
-- >   where
-- >     go = \case
-- >       HTMLDocument n c ->
-- >         concatMap go c
-- >       e @ (HTMLElement "a" s a c) ->
-- >         case htmlElemAttrFind (htmlAttrHasName "href") e of
-- >           Just (HTMLAttr n v s) ->
-- >             v : concatMap go c
-- >           Nothing ->
-- >             concatMap go c
-- >       HTMLElement n s a c ->
-- >         concatMap go c
-- >       _otherwise ->
-- >         []
-- >
-- > extract "<a href=\"https://example1.com\"></a><a href=\"https://example2.com\"></a>"
--
-- The extract function will give the following list.
--
-- > [ "https://example1.com"
-- > , "https://example2.com"
-- > ]
--
-- $query
--
-- The library includes a basic query facility implemented as a thin wrapper
-- around an `HTMLZipper`.  Queries match patterns in HTML structures and can
-- be used to extract information or update documents.  As a first example,
-- consider the following HTML.
--
-- > <p>
-- >   <span id="x" class="y z"></span>
-- >   <br>
-- >   <a href="bbb">AAA</a>
-- >   <img>
-- > </p>
--
-- The HTML can be parsed as normal.  Note though the additional step of
-- whitespace removal, which is often important in documents that include
-- indentation such as above.
--
-- > fromJust . htmlSpaceRemove . fromJust . htmlDocBody . htmlParseEasy
--
-- Now a query function can be defined.  This function expects to be given
-- a @body@ element whose first child is a @p@ element whose first child
-- has an id of @x@ whose second sibling is an anchor element.  If all of
-- those conditions are met, the the text contents of the anchor is returned.
--
-- > query :: HTMLNode -> Maybe Text
-- > query = htmlQueryExec $ do
-- >   htmlQueryName "body"
-- >   htmlQueryFirst
-- >   htmlQueryName "p"
-- >   htmlQueryFirst
-- >   htmlQueryId "x"
-- >   htmlQueryNext
-- >   htmlQueryNext
-- >   htmlQueryName "a"
-- >   a <- htmlQueryNode
-- >   htmlQuerySucc $
-- >     fromMaybe "" $ htmlElemText a
--
-- Running the query on the parsed document will give the result.
--
-- > Just "AAA"
--
-- Queries can also be used to modifiy documents.  In the next example, let's
-- say we would like to find any @img@ that is the only content in a @div@ and
-- replace the @div@ with a link.  The document could look as follows.
-- 
-- > <section><div><img src="aaa"></div></section>
-- > <section><div><img src="bbb"></div></section>
-- > <section><div><img src="ccc"></div></section>
--
-- A query function can be defined to match the desired pattern and return the
-- modified element.
--
-- > query2 :: HTMLNode -> HTMLNode
-- > query2 = htmlQueryTry $ do
-- >   htmlQueryName "div"
-- >   htmlQueryOnly "img"
-- >   a <- htmlQueryNode
-- >   let Just b = htmlElemGetAttr "src" a
-- >   htmlQuerySucc $
-- >     htmlElem "a" [ htmlAttr "href" b ]
-- >       [ htmlText b ]
--
-- The query can then be applied to the entire document using `htmlMapElem`.
--
-- > htmlMapElem query2
--
-- Rendering the mapped query with give the updated content.
--
-- > <section><a href="aaa">aaa</a></section>
-- > <section><a href="bbb">bbb</a></section>
-- > <section><a href="ccc">ccc</a></section>
--
-- $samples
--
-- The unit tests include the above samples as well as many other example
-- usages of the library.
--
-- $history
--
-- Zenacy HTML was originally developed for Zenacy Reader Technologies LLC
-- starting around 2015 and used in a web reading SaaS for a few years.
-- The need to understand and handle the wide variety and sublties of HTML
-- found on the web lead to the development of library that closely followed
-- the standard.  The library was tweaked and optimized a bit and though
-- there is room for more improvements the result worked quite well in
-- production (a lot of credit goes to the GHC team and Haskell community
-- for providing such great, fast functional programming tools).