{-| We present an algorithm for de-sugaring distributed affixes (/distfixes/) in a rose-like data structure. Distfixes are also known as /mixfixes/, but I chose /dist-/ because the parts of the affix are distributed in-order through the root, rather than mixed in (out-of-order connotation) with the root. Now then, let's actually describe a distfix in detail: By /rose-like/ data structure, we mean any type @t@ such that when an element of @t@ can be 'unwrap'ped into a @[t]@, we can perform rewrites according to our distfix algorithm and 'rewrap' the result. If a particular element cannot be 'unwrap'ped, then it will be left alone during rewriting. Of course, this library was meant to operate on 'Hexprs' and 'Quasihexprs', but it could just as well work on a plain list or rose, as well as anything else you're willing to mangle into shape. A distributed affix consists of a number of alternating /keywords/ and /slots/. While keywords should match exactly one leaf node, slots can consume multiple nodes (leaves or branches) during a detection. If we denote slots by underscores and keywords by some reasonable programming language identifier (w/o underscores), then some representative distfix examples might be @_+_@, @_?_:_@, @_!@, @if_then_else_@, and @while_do_end@. Using the algorithm requires categorizing the input distfixes in several dimensions: /topology/, /associativity/, /priority/, and /precedence/. Only precedence need by specified by the user (it is extrinsic to any distfix), the rest are either specified in or calculated from the distfix at hand. We discuss these properties below: Slots in a distfix are always separated by keywords, but they may also be a leading and/or trailing keyword in a distfix. The presence or absence of certain keywords is the /topology/ of a distfix, and this affects the possibilities of its /associativity/. There are four options: * /Closed/: preceded and followed by keywords (e.g. @begin_end@) * /Half-open Left/: only followed by a keyword (e.g. @_!@) * /Half-open Right/: only preceded by a keyword (e.g. @if_then_else_@) * /Open/: neither preceded nor followed by a keyword (e.g. @_+_@) As usual, there are three /associativities/: /left-/, /right-/, and /non-associative/. Open distfixes can take any of these three. Closed distfixes have no associativity. Half-open left distfixes are always left associative, and half-open right are always right associative. Operators are divided into /precedence/ levels as normal, but there are no limits on the number of precedence levels available for use. In the distfix table, groups of distfixes of the same precedence are sorted in descending order. When given a list of expressions (the contents of an 'unwrap'ped node) and a distfix, the distfix may be /detected/ within the list. When multiple distfixes in a single precedence level are detected at once, an attempt is made to /select/ exactly one of the detected distfixes using a /priority scheme/ calculated from the properties of the distfixes in question. Provided that one distfix has a higher priority than all the other detected distfixes, the highest priority distfix binds least tightly (and is therefore selected first). The rules for calculating priority are these: * If both distfixes have the same associativity (left- or right-, but not non-associative), the one with the \"most significant\" keyword \"earliest\" has priority: for left-associative, most significant means first and earliest means leftmost; for right-associative, most significant means last, earliest means rightmost. If its a still a tie, then the one with the most keywords has priority. * If both distfixes are closed, then they must be non-overlapping, or one must contain the other. It doesn't really matter which has higher priority if they don't overlap (as it happens, we've chosen leftmost for now). If one nests within the other, the outer has priority. If they overlap exactly, then the one with the most keywords has priority. * Other pairs of matches have no priority distinction. Given that a particular distfix is detected and selected for rewriting, we rewrite the list of terms by /extracting/ the distfix from its slots. Specifically, we take the detected elements and run them through the distfix's /rewriter/ to produce some single element. We then place the rewritten element at the front of the node, followed by each (filled) slot in order and 'rewrap'ped in its own node. The re-written list is finally 'rewrap'ped and placed back in its original context. Detections are made recursively. The details are unimportant except that this algorithm is applied at every branch in the structure /as made available/ by 'unwrap' and the recursion respects precedence and priority. Each branch is assumed to have been enclosed by parenthesis during parsing, and therefore 'unwrap'ping resets the precedence level. Note that rewriting only adds branches to the structure, never removes them, and so we can see distfixes as adding implicit parenthesis, which can be quite valuable as a conservative tool for increasing the signal-to-noise ratio in a programming language. Now for some technical notes: I'm not sure how detection and priority will work if the same keyword appears twice in the same distfix, so it's probably best to avoid that for now. Or work it out and tell me, whatever. Either someone will eventually need this, at which point we'll deal with it, or maybe I'll get bored, or maybe I just won't care enough relative to other problems. The two-typeclass system might seem a bit strange, but this is so I can avoid making the user involve ghc's @FlexibleInstances@ extension. So, give an instance for @'DistfixElement' SomeType@ and @'DistfixElement' a => 'DistfixStructure' ('Hexpr' a)@, with 'nodeMatch' simply unwrapping 'Leaf' and delegating to 'match'. -} module Language.Distfix ( -- * Data Structures Distfix(..) , Shape(..) , DistfixTable -- * Classes , DistfixStructure(..) , DistfixElement(..) -- * User Space , runDistfix , DistfixError(..) ) where import Data.Ord import Data.List import Data.Maybe import Data.Either import Control.Applicative import Control.Monad ------ Types ------ {-| These data structures can be de-structured in a rose-like fashion. See the module description for detail on the meaning of \"rose-like\". There is one law: [Inverse] @maybe node (\(xs, rewrap) -> rewrap xs) (unwrap node) === node@ In other words, if you can unwrap a node, then rewrapping will perform the inverse. -} class DistfixStructure f where {-| Unpack a branch node into a list of that branch's children and a rewrapping function. -} unwrap :: f -> Maybe ([f], [f] -> f) {-| Workaround so I can give an instance of Show (DistfixError a). -} defaultRewrap :: [f] -> f {-| Whereas 'match' operates on elements of the structure, 'nodeMatch' is really just boilerplate that extracts an element and calls 'match' on it. For example, we might write @ instance 'DistfixDetect' a => 'DistfixStructure' ('Hexpr' a) nodeMatch ('Leaf' x) ('Leaf' y) = match x y nodeMatch _ _ = False @ Very probably, it would not make sense to allow a non-leaf node to match anything (by implication, disallowing non-leaf keywords). -} nodeMatch :: f -> f -> Bool {-| This class is used for matching instead of 'Eq' so that certain components of the data might be ignored. For example, if @a = (SourcePos, b)@ then the @SourcePos@ should clearly be ignored during matching. -} class DistfixElement a where {-| Whether the two elements are equal with respect to matching a keyword. -} match :: a -> a -> Bool {-| A distfix consists of 1) a rewriter, the results of which precede the slots when extracting, 2) a topology and associativity, which is actually merged into a single datatype 'Shape' because the choice of associativity is not independent of topology, and 3) a non-empty list of keywords, each implicitly separated by a slot. In case a distfix has a closed topology, its list of keywords must actually be at least two elements long (one for the open keyword, and one for the close keyword). For more detail on these components, see the module documentation. -} data Distfix a = Distfix ([a] -> a) Shape [a] {-| Information on both topology and associativity. The two properties are merged into one datatype because choice of one limits choice of the other. The constructors should make the possibilities clear enough, but the module documentation might better present the reasoning involved. -} data Shape = Closed | HalfOpenRight | HalfOpenLeft | OpenRight | OpenLeft | OpenNon deriving (Eq, Show) {-| A list, in descending order of precedence (ascending of binding tightness) of groups of Distfixes. How tightly distfixes within a group bind relative to one another is determined by priority (see the module description). Although ambiguous grammars are accepted, it might be best to avoid forcing the user to make lots of priority calculations just to determine if they need to insert disambiguating parenthesis. -} type DistfixTable a = [[Distfix a]] newtype Detection a = Detection { unMatch :: (Distfix a, [a], [a], [[a]], [a]) } data MatchResult a = NoMatch | OneMatch (Detection a) | Ambiguous [Detection a] type DistfixResult a = DistfixResult' a a -- I only threw @DistfixResult'@ in here so I can make a monad. @DistfixResult@ is the more important one newtype DistfixResult' e a = Result { unResult :: Either (DistfixError e) a } {-| Report reasons for error in recognizing distfixes. There are two causes of error: [Ambiguity] When there is no single detection that has higher precedence or priority within a set of detections made in a node, this is an ambiguous parse. Note that ambiguous grammars are allowed in this scheme, but should this ambiguity manifest itself in an input, that input is not recognized. Really, this is pretty spiffy: distfixes admit specification fairly near to an arbitrary context-free grammar, but the algorithm will excise ambiguity only where it needs to, completely side-stepping the problem of whether a given grammar is ambiguous. [Leftovers] Once we've detected all the keywords possible in a node, we need to ensure there are no leftover keywords. If there were, this would probably indicate a user forgetting a keyword. For example, suppose @[|_|]@ were a distfix then @[| a ]@ would obtain a LeftoverErr. There's some fuzziness between 'AmbiguousErr' and 'LeftoverErr'. To illustrate, suppose we have @_<_@ and @_<=_@ but not @_<=_<_@ as a distfix, then both @a < b < c@ and @a <= b < c@ will be errors. The first will result in leftovers, and the second in ambiguity. It would make sense if they were both 'AmbiguousErr', but doing so under the current structure would sacrifice some efficiency (and possibly complicate matters). Still, at least everything that /should/ be an error /is/ an error. -} data DistfixError a = AmbiguousErr [(Distfix a, [a], [a], [[a]], [a])] | LeftoverErr [a] ------ Instances ------ instance (Show a, DistfixStructure a) => Show (DistfixError a) where show (AmbiguousErr matches) = headText ++ concatMap makeLine matches where headText = "Ambiguous distfix parse. Could have been one of:" makeLine = ("\n\t"++) . show . right . extract (return . defaultRewrap) defaultRewrap . Detection right (Result (Right x)) = x show (LeftoverErr [k]) = "Leftover keyword: " ++ show k show (LeftoverErr ks) = "Leftover keywords:" ++ concatMap ((' ':) . show) ks ------ Main Algorithm ------ {-| Given a table of distfixes and some input structure, apply the distfix detection/extraction algorithm. The algorithm may fail with a 'DistfixError'. The module description explains successful results in more detail. -} runDistfix :: DistfixStructure a => DistfixTable a -> a -> Either (DistfixError a) a runDistfix table x = case unwrap x of Nothing -> return x Just (xs, rewrap) -> mapM (runDistfix table) xs >>= unResult . (impl rewrap table) where impl rewrap [] xs = findLeftovers rewrap allKeywords xs impl rewrap table'@(row:rows) xs = case select row xs of NoMatch -> impl rewrap rows xs OneMatch op -> extract (impl rewrap table') rewrap op Ambiguous ops -> Result . Left . AmbiguousErr $ fmap unMatch ops allKeywords = nubBy nodeMatch . (concatMap . concatMap) (\(Distfix _ _ ks) -> ks) $ table {-| It's a pretty weird mutual-recursion thing going on between runDistfix.impl and extract. See, we obviously have to recurse on the inner nodes, but then we also need to recurse on the reconstructed node, in case of nodes like `a + if p then conseq else alt` I'm basically passing in `recurse` as a specialized delimited continuation. There's another function that uses extract, so I couldn't just put extract in the closure with impl. I'm pretty sure it's necessary to pass the recursion in anyway (but I can't remember why). -} extract :: DistfixStructure a => ([a] -> DistfixResult a) -> ([a] -> a) -> Detection a -> DistfixResult a extract recurse rewrap (Detection (Distfix rewrite _ _, found, before, inside, after)) = do inside' <- rewrap . (rewrite found:) <$> mapM recurse inside recurse $ before ++ [inside'] ++ after ------ Selection ------ {-| Given a bunch of distfixes (at the same precedence level), try to find an unambiguous distfix parse within a list. -} select :: DistfixStructure a => [Distfix a] -> [a] -> MatchResult a select ops xs = impl detectAll [] where detectAll = catMaybes $ map (detect xs) ops impl [] eqSet = case eqSet of [] -> NoMatch [x] -> OneMatch x xs -> Ambiguous xs impl (x:xs) eqSet = if is Lower then impl xs eqSet else if is Higher then impl xs [x] else impl xs (x:eqSet) where is = (`elem` map (decidePriority x) eqSet) {-| Given two detections, give the relative priority of the first to the second. -} decidePriority :: Detection a -> Detection a -> Priority decidePriority a@(Detection (Distfix _ topA ksA, _, bA, iA, aA)) b@(Detection (Distfix _ topB ksB, _, bB, iB, aB)) = case (topA, topB) of (OpenLeft, OpenLeft) -> decideLeft (OpenLeft, HalfOpenLeft) -> decideLeft (HalfOpenLeft, OpenLeft) -> decideLeft (HalfOpenLeft, HalfOpenLeft) -> decideLeft (OpenRight, OpenRight) -> decideRight (OpenRight, HalfOpenRight) -> decideRight (HalfOpenRight, OpenRight) -> decideRight (HalfOpenRight, HalfOpenRight) -> decideRight (Closed, Closed) -> decideClosed (Closed, _) -> Higher (_, Closed) -> Lower _ -> Same where decideRight = leftmost `joinPriority` mostKeywords where leftmost = fromOrd . negOrd $ comparing leftmostKeyword a b decideLeft = rightmost `joinPriority` mostKeywords where rightmost = fromOrd $ comparing rightmostKeyword a b decideClosed = leftmostNoOverlap `joinPriority` outermost `joinPriority` (if exactOverlap then mostKeywords else Same) where leftmostNoOverlap = if aR < bL then Higher else if bR < aL then Lower else Same outermost = case (compare aL bL, compare aR bR) of (LT, GT) -> Higher -- `a b b a` (GT, LT) -> Lower -- `b a a b` (LT, EQ) -> Higher -- `a b ab` (GT, EQ) -> Lower -- `b a ab` (EQ, LT) -> Lower -- `ab a b` (EQ, GT) -> Higher -- `ab b a` _ -> Same exactOverlap = aL == bL && aR == bR aL = leftmostKeyword a aR = rightmostKeyword a bL = leftmostKeyword b bR = rightmostKeyword b mostKeywords = fromOrd $ comparing impl a b where impl (Detection (Distfix _ _ ks, _, _, _, _)) = length ks -- index of leftmost keyword in the original node leftmostKeyword (Detection (Distfix _ OpenRight _, _, _, inside, _)) = length (head inside) leftmostKeyword (Detection (Distfix _ HalfOpenRight _, _, before, _, _)) = length before leftmostKeyword (Detection (Distfix _ Closed _, _, before, _, _)) = length before -- index of rightmost keyword in the original node rightmostKeyword (Detection (Distfix _ OpenLeft _, _, _, inside, _)) = sum (map length $ init inside) + (length (init inside) - 1) rightmostKeyword (Detection (Distfix _ HalfOpenLeft _, _, _, inside, _)) = sum (map length inside) + (length inside - 1) rightmostKeyword (Detection (Distfix _ Closed _, _, before, inside, _)) = length before + sum (map length inside) + length inside ------ Detection ------ {-| Once all possible detections have been found in a node, use this to repack. -} findLeftovers :: DistfixStructure a => ([a] -> a) -> [a] -> [a] -> DistfixResult a findLeftovers rewrap ks xs = case filter (\x -> nodeMatch x `any` ks) xs of [] -> Result . Right . rewrap $ xs errs -> Result . Left $ LeftoverErr errs {-| FIXME MAYBE This guy ignores the possibility of several matches of the same keyword, which may lead to weird error messages? Not sure, but if weird error messages don't haunt us, this is more efficient by a fair margin. -} detect :: DistfixStructure a => [a] -> Distfix a -> Maybe (Detection a) detect xs fix@(Distfix _ topology ks) = do when (length ks == 0) $ error "distfixes must have at least one keyword" (found, before, inside, after) <- case topology of Closed -> do when (length ks < 2) $ error "closed distfixes must have at least two keywords" (as, kw1, ds) <- findKey (head ks) xs (cs, kw, bs) <- revFindKey (last ks) ds (kws, res) <- detectBody (init . tail $ ks) cs Just ([kw1]++kws++[kw], as, res, bs) HalfOpenRight -> do (as, kw1, bs) <- findKey (head ks) xs (kws, res) <- detectBody (tail ks) bs Just ([kw1]++kws, as, res, []) HalfOpenLeft -> do (as, kw, bs) <- revFindKey (last ks) xs (kws, res) <- revDetectBody (init ks) as Just (kws++[kw], [], res, bs) OpenRight -> do (as, kw1, bs) <- findKey (head ks) xs -- find the first one, so the following ones get wrapped in implicit parens (kws, res) <- detectBody (tail ks) bs Just ([kw1]++kws, [], as:res, []) OpenLeft -> do (as, kw, bs) <- revFindKey (last ks) xs -- find the last one, so the preceding ones get wrapped in implicit parens (kws, res) <- revDetectBody (init ks) as Just (kws++[kw], [], res++[bs], []) OpenNon -> do (as, kw1, bs) <- findKey (head ks) xs (kws, res) <- detectBody (tail ks) bs if isJust $ detect (last res) fix then Nothing else Just ([kw1]++kws, [], as:res, []) if null `any` inside then Nothing else Just $ Detection (fix, found, before, inside, after) {-| recognize keyword-slot pairs left-to-right, so use as a continuation after stripping away leading/trailing keywords -} detectBody :: DistfixStructure a => [a] -> [a] -> Maybe ([a], [[a]]) detectBody ks xs = impl ks xs [] [] where impl [] xs kws xss = Just (reverse kws, reverse (xs:xss)) impl (k:ks) xs kws xss = do (as, kw, bs) <- findKey k xs impl ks bs (kw:kws) (as:xss) {-| as detect body, but right-to-left, for left-associative things -} revDetectBody :: DistfixStructure a => [a] -> [a] -> Maybe ([a], [[a]]) revDetectBody ks xs = do (kws, xss) <- impl (reverse ks) (reverse xs) [] [] return (kws, map reverse xss) where impl [] xs kws xss = Just (reverse kws, reverse (xs:xss)) impl (k:ks) xs kws xss = do (bs, kw, as) <- findKey k xs impl ks bs (kw:kws) (as:xss) {-| Get the parts of a list (before, after) the given keyword. Start from the left. -} findKey :: DistfixStructure a => a -> [a] -> Maybe ([a], a, [a]) findKey kw xs = case nodeMatch kw `break` xs of res@(_, []) -> Nothing (before, (k:after)) -> Just (before, k, after) {-| As findKey, but start from right. -} revFindKey :: DistfixStructure a => a -> [a] -> Maybe ([a], a, [a]) revFindKey kw xs = do (b,k,a) <- findKey kw (reverse xs) return (reverse a, k, reverse b) ------ Helpers ------ {-| if the first way of determining priority works, take it, otherwise try the second way -} joinPriority :: Priority -> Priority -> Priority joinPriority Same y = y joinPriority x _ = x data Priority = Higher | Lower | Same deriving (Eq) fromOrd LT = Lower fromOrd EQ = Same fromOrd GT = Higher negOrd :: Ordering -> Ordering negOrd LT = GT negOrd EQ = EQ negOrd GT = LT instance Functor (DistfixResult' e) where fmap = liftM instance Applicative (DistfixResult' e) where pure = return (<*>) = ap instance Monad (DistfixResult' e) where return = Result . Right (Result x) >>= k = Result (x >>= unResult . k) instance (Show a) => Show (Distfix a) where show (Distfix _ shape x) = "Distfix " ++ show shape ++ " " ++ show x