purescript-0.15.5: PureScript Programming Language Compiler
Safe HaskellSafe-Inferred
LanguageHaskell2010

Language.PureScript.CST.Layout

Description

## High-Level Summary

This section provides a high-level summary of this file. For those who know more about compiler-development, the below explanation is likely enough. For everyone else, see the next section.

The parser itself is unaware of indentation, and instead only parses explicit delimiters which are inserted by this layout algorithm (much like Haskell). This is convenient because the actual grammar can be specified apart from the indentation rules. Haskell has a few problematic productions which make it impossible to implement a purely lexical layout algorithm, so it also has an additional (and somewhat contentious) parser error side condition. PureScript does not have these problematic productions (particularly foo, bar :: SomeType syntax in declarations), but it does have a few gotchas of it's own. The algorithm is "non-trivial" to say the least, but it is implemented as a purely lexical delimiter parser on a token-by-token basis, which is highly convenient, since it can be replicated in any language or toolchain. There is likely room to simplify it, but there are some seemingly innocuous things that complicate it.

Naked commas (case, patterns, guards, fundeps) are a constant source of complexity, and indeed too much of this is what prevents Haskell from having such an algorithm. Unquoted properties for layout keywords introduce a domino effect of complexity since we have to mask and unmask any usage of . (also in foralls!) or labels in record literals.

## Detailed Summary

### The Problem

The parser itself is unaware of indentation or other such layout concerns. Rather than dealing with it explicitly, the parser and its grammar rules are only aware of normal tokens (e.g. TokLowerName) and three special zero-width tokens, TokLayoutStart, TokLayoutSep, and TokLayoutEnd. This is convenient because the actual grammar can be specified apart from the indentation rules and other such layout concerns.

For a simple example, the parser parses all three examples of the code below using the exact same grammar rules for the let keyword despite each example using different indentations levels:

-- Example 1
let foo = 5
    x = 2 in foo

-- Example 2
let
  bar = 5
  y = 2
in bar

-- Example 3
let        baz
                 =
             5
           z= 2 in baz

Each block of code might appear to the parser as a stream of the following source tokens where the { sequence represents TokLayoutStart, the ; sequence represents TokLayoutSep, and the } sequence represents TokLayoutEnd: - let {foo = 5;x = 2} in foo - let {bar = 5;y = 2} in bar - let {baz = 5;z = 2} in baz

For a more complex example, consider commas:

case one, { twoA, twoB }, [ three1
   , three2
   ,   do
     { three3, three4 } <- case arg1, arg2 of
        Nothing, _ -> { three3: 1, three4: 2 }
        Just _, Nothing -> { three3: 2, three4: 3 }
        _, _ -> { three3: 3, three4: 4 }
     pure $ three3 + three4
   ] of

Which of the above 13 commas function as the separaters between the case binders (e.g. one) in the outermost case ... of context?

### The Solution

The parser doesn't have to care about layout concerns (e.g. indentation or what starts and ends a context, such as a case binder) because the lexer solves that problem instead.

So, how does the lexer solve this problem? It follows this general algorithm: 1. Lex the source code text into an initial stream of SourceTokens that do not have any of the three special tokens mentioned previously. 2. On a token-by-token basis, determine whether the lexer should 1. insert one of the three special tokens, 2. modify the current context (e.g. are we within a case binder? Are we in a record expression?)

Step 2 is handled via insertLayout and is essentially a state machine. The layout delimiters, (e.g. LytCase, LytBrace, LytProperty, and LytOf in the next section's example) either stop certain "rules" from applying or ensure that certain "rules" now apply. By "rules", we mean whether and where one of the three special tokens are added. The comments in the source code for the insertLayout algorithm call pushing these delimiters onto the stack "masking" and popping them off as "unmasking". Seeing when a layout delimiter is pushed and popped are the keys to understanding this algorithm.

### Walking Through an Example

Before showing an example, let's remember a few things. 1. The TokLowerName "case" token (i.e. a "case" keyword) indicates the start of a case ... of context. That context includes case binders (like the example shown previously) that can get quite complex. When encountered, we may need to insert one or more of the three special tokens here until we encounter the terminating TokLowerName "of" token that signifies its end. 2. "case" and "of" can also appear as a record field's name. In such a context, they would not start or end a case ... of block.

Given the below source code...

case { case: "foo", of: "bar" } of

the lexer would go through something like the following states: 1. Encountered TokLowerName "case". Update current context to "within a case of expression" by pushing the LytCase delimiter onto the layout delimiter stack. Insert the case token into the stream of source tokens. 2. Encountered TokLeftBrace. Update current context to "within a record expression" by pushing the LytBrace delimiter. Since we expect a field name to be the next token we see, which may include a reserved keyword, update the current context again to "expecting a field name" by pushing the LytProperty. delimiter. Insert the { token into the stream of source tokens. 3. Encountered TokLowerName "case". Check the current context. Since it's a LytProperty, this is a field name and we shouldn't assume that the next few tokens will be case binders. However, since this might be a record with no more fields, update the current context back to "within a record expression" by popping the LytProperty off the layout delimiter stack. Insert the case token 4. Encountered TokColon. Insert the : token 5. Encountered TokLowerName "foo". Insert the foo token. 6. Encountered TokComma. Check the current context. Since it's a LytBrace, we're in a record expression and there is another field. Update the current context by pushing LytProperty as we expect a field name again. 7. Encountered TokLowerName "of". Check the current context. Since it's a LytProperty, this is a field name rather than the end of a case binder. Thus, we don't expect the next tokens to be the body in a case ... of body expression. However, since this might be a record with no more fields, update the current context back to "within a record expression" by popping the LytProperty off the stack. Insert the of token. 8. Encountered TokRightBrace. Check the current context. Since it's a LytBrace, this is the end of a record expression. Update the current context to "within a case of expression" by popping the LytBrace off the stack. Insert the } token. 9. Encountered TokLowername "of". Check the current context. Since it's a LytCase, this is the end of a case ... of expression and the body will follow. Update the current context to "body of a case of expression" by pushing LytOf onto the layout stack. Insert the of token into the stream of tokens.

Documentation