-
Notifications
You must be signed in to change notification settings - Fork 1
docs: advanced guides for conflict resolution and error handling (#191) #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| # Guide: Conflict Resolution | ||
|
|
||
| LR parsers, like the one generated by Alpaca, can encounter **conflicts** when the grammar is ambiguous. A conflict occurs when the parser has multiple valid actions for a given state and lookahead token. | ||
|
|
||
| Alpaca provides a declarative DSL to resolve these conflicts at compile-time, ensuring your parser is deterministic and behaves as expected. | ||
|
|
||
| ## 1. Types of Conflicts | ||
|
|
||
| ### Shift/Reduce Conflict | ||
| Occurs when the parser can either **shift** (move the current token to the stack) or **reduce** (apply a grammar rule to symbols already on the stack). | ||
|
|
||
| **Example (Dangling Else):** | ||
| ```scala | ||
| if (cond) if (cond) stmt else stmt | ||
| ``` | ||
| Should the `else` belong to the first `if` or the second? | ||
|
|
||
| ### Reduce/Reduce Conflict | ||
| Occurs when the parser has two or more different rules that could be applied to the same symbols on the stack. This usually indicates a more serious ambiguity in the grammar. | ||
|
|
||
| ## 2. Resolving Conflicts in Alpaca | ||
|
|
||
| Conflicts are resolved by overriding the `resolutions` member in your `Parser` object. This member is a `Set` of `ConflictResolution` rules. | ||
|
|
||
| ```scala | ||
| object MyParser extends Parser: | ||
| // ... rules ... | ||
|
|
||
| override val resolutions = Set( | ||
| // Resolve conflicts here | ||
| ) | ||
| ``` | ||
|
|
||
| ### The `before` and `after` Operators | ||
|
|
||
| Alpaca uses `before` and `after` to define precedence: | ||
|
|
||
| - **`A after B`**: `A` has **higher precedence** than `B`. `A` will be reduced **later** than `B` (or `B` will be shifted while `A` is waiting). | ||
| - **`A before B`**: `A` has **lower precedence** than `B`. `A` will be reduced **earlier** than `B`. | ||
|
|
||
| You can use these operators with both **Tokens** (Terminals) and **Productions** (Rules). | ||
|
|
||
| #### Operator Precedence Example | ||
| ```scala | ||
| override val resolutions = Set( | ||
| MyLexer.STAR after MyLexer.PLUS, | ||
| MyLexer.SLASH after MyLexer.PLUS | ||
| ) | ||
| ``` | ||
| This tells the parser that `*` and `/` have higher precedence than `+`. | ||
|
|
||
| #### Associativity Example | ||
| To define **left-associativity** (e.g., `1 + 2 + 3` is `(1 + 2) + 3`), a production should be "before" its own recursive tokens. | ||
|
|
||
| ```scala | ||
| val Expr: Rule[Int] = rule( | ||
| "add" { case (Expr(l), MyLexer.PLUS(_), Expr(r)) => l + r } | ||
| ) | ||
|
|
||
| override val resolutions = Set( | ||
| production.add before MyLexer.PLUS | ||
| ) | ||
| ``` | ||
|
|
||
| ## 3. Named Productions | ||
|
|
||
| For fine-grained control, you can assign names to specific productions within a rule. This allows you to reference that exact production in your `resolutions`. | ||
|
|
||
| ```scala | ||
| val Expr: Rule[Int] = rule( | ||
| "mul" { case (Expr(l), MyLexer.STAR(_), Expr(r)) => l * r }, | ||
| "div" { case (Expr(l), MyLexer.SLASH(_), Expr(r)) => l / r }, | ||
| { case MyLexer.NUM(n) => n.value } | ||
| ) | ||
|
|
||
| override val resolutions = Set( | ||
| production.mul after MyLexer.PLUS, | ||
| production.div after MyLexer.PLUS | ||
| ) | ||
| ``` | ||
|
|
||
| ## 4. Understanding Conflict Errors | ||
|
|
||
| When Alpaca detects an unresolved conflict, it produces a detailed compile-time error message. | ||
|
|
||
| **Example Error:** | ||
| ```text | ||
| Shift "+ ($plus)" vs Reduce Expr -> Expr + ($plus) Expr | ||
| In situation like: | ||
| Expr + ($plus) Expr + ($plus) ... | ||
| Consider marking production Expr -> Expr + ($plus) Expr to be alwaysBefore or alwaysAfter "+ ($plus)" | ||
| ``` | ||
|
|
||
| ### How to read it: | ||
| 1. **The Conflict**: It tells you exactly which actions are clashing (Shift `+` vs Reduce `Expr`). | ||
| 2. **The Context**: "In situation like..." shows you a sample input sequence that triggers the ambiguity. | ||
| 3. **The Suggestion**: It suggests which production and token need a resolution rule. | ||
|
|
||
| ## 5. Debugging with Graphs | ||
|
|
||
| If enabled in [Debug Settings](../debug-settings.html), Alpaca can generate internal representations of the parse table and conflict resolutions to help you visualize complex grammars. | ||
|
|
||
| ## Best Practices | ||
|
|
||
| 1. **Keep it Minimal**: Only add resolutions for actual conflicts reported by the compiler. | ||
| 2. **Use Named Productions**: They make your intentions clearer than referencing raw tokens or entire rules. | ||
| 3. **Think in Terms of Trees**: Remember that "higher precedence" (`after`) means the operation will appear **lower** in the resulting AST (it stays on the stack longer). |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,134 @@ | ||||||||||||||||||||||||||
| # Guide: Contextual Parsing | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| Contextual parsing refers to the ability of a lexer or parser to change its behavior or maintain state based on the input it has already seen. Alpaca provides powerful mechanisms for this through `LexerCtx`, `ParserCtx`, and the `BetweenStages` hook. | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| ## 1. Lexer-Level Context | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| Most contextual logic in Alpaca happens at the lexer level. Since the lexer tokenizes the entire input before the parser starts, the lexer context is the primary place to track state that affects tokenization. | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| ### Pattern: Brace Matching & Nesting | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| You can use a context to track nesting levels of braces, parentheses, or brackets. | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| ```scala | ||||||||||||||||||||||||||
| import alpaca.* | ||||||||||||||||||||||||||
| import scala.collection.mutable.Stack | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| case class BraceCtx( | ||||||||||||||||||||||||||
| var text: CharSequence = "", | ||||||||||||||||||||||||||
| val stack: Stack[String] = Stack() | ||||||||||||||||||||||||||
| ) extends LexerCtx | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| val myLexer = lexer[BraceCtx]: | ||||||||||||||||||||||||||
| case "\(" => | ||||||||||||||||||||||||||
| ctx.stack.push("paren") | ||||||||||||||||||||||||||
| Token["("] | ||||||||||||||||||||||||||
| case "\)" => | ||||||||||||||||||||||||||
| if ctx.stack.isEmpty || ctx.stack.pop() != "paren" then | ||||||||||||||||||||||||||
| throw RuntimeException("Mismatched parenthesis") | ||||||||||||||||||||||||||
| Token[")"] | ||||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| ### Pattern: Indentation-Based Parsing | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| For languages like Python or Scala (with `braceless` syntax), you can track indentation levels in the context and emit virtual `INDENT` and `OUTDENT` tokens (or just adjust state for later use). | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| ```scala | ||||||||||||||||||||||||||
| case class IndentCtx( | ||||||||||||||||||||||||||
| var text: CharSequence = "", | ||||||||||||||||||||||||||
| var currentIndent: Int = 0, | ||||||||||||||||||||||||||
| val indents: Stack[Int] = Stack(0) | ||||||||||||||||||||||||||
| ) extends LexerCtx | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| val pythonicLexer = lexer[IndentCtx]: | ||||||||||||||||||||||||||
| case x @ " | ||||||||||||||||||||||||||
| +" => | ||||||||||||||||||||||||||
|
Comment on lines
+44
to
+45
|
||||||||||||||||||||||||||
| case x @ " | |
| +" => | |
| case x @ "\\n +" => |
Copilot
AI
Mar 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section states that a Lexeme has a .fields property and shows id.fields.line, but fields is not publicly accessible on alpaca.internal.lexer.Lexeme (it’s private[alpaca]). Readers should access captured context fields via the lexeme’s dynamic members (e.g., id.line, id.position, id.text) or whatever the intended public API is.
| // id is a Lexeme, which has a .fields property | |
| // fields contains all members of your LexerCtx | |
| println(s"Matched ID at line ${id.fields.line}") | |
| // id is a Lexeme; captured context fields are exposed as dynamic members | |
| // e.g. if your LexerCtx has a `line` field, you can access it as `id.line` | |
| println(s"Matched ID at line ${id.line}") |
Copilot
AI
Mar 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mode-switching example has invalid Scala string literals: case """ => is an unterminated triple-quoted string, and the regex pattern "[^"]+" contains an unescaped quote. Please rewrite these patterns using valid Scala literals (often easiest with properly delimited triple-quoted strings) so the example can be copied verbatim.
| case """ => | |
| ctx.inString = !ctx.inString | |
| Token["QUOTE"] | |
| case "[a-z]+" if !ctx.inString => Token["KEYWORD"] | |
| case "[^"]+" if ctx.inString => Token["STRING_CONTENT"] | |
| case "\"" => | |
| ctx.inString = !ctx.inString | |
| Token["QUOTE"] | |
| case "[a-z]+" if !ctx.inString => Token["KEYWORD"] | |
| case """[^"]+""" if ctx.inString => Token["STRING_CONTENT"] |
Copilot
AI
Mar 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guide suggests customizing BetweenStages via a user-provided given, but BetweenStages is currently declared private[alpaca] (see src/alpaca/internal/lexer/BetweenStages.scala), so downstream users can’t reference or implement it. Either expose BetweenStages as part of the public API (or provide a public hook) or adjust the documentation to reflect the supported customization mechanisms (e.g., mixing in LineTracking/PositionTracking).
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,90 @@ | ||||||
| # Guide: Lexer Error Handling | ||||||
|
|
||||||
| By default, Alpaca's lexer is strict: it throws a `RuntimeException` as soon as it encounters a character that doesn't match any of your defined token patterns. | ||||||
|
|
||||||
| However, for real-world applications, you often want the lexer to be more resilient, either by reporting multiple errors or by skipping invalid characters and continuing. | ||||||
| This guide explores strategies for implementing custom error handling in your lexer. | ||||||
|
|
||||||
| ## 1. Default Behavior | ||||||
|
|
||||||
| When `tokenize` fails to find a match, it throws an exception: | ||||||
| ```text | ||||||
| java.lang.RuntimeException: Unexpected character: '?' | ||||||
| ``` | ||||||
|
|
||||||
| ## 2. Strategy: Catch-all Error Token | ||||||
|
|
||||||
| The most common strategy is to add a pattern at the **end** of your lexer that matches any single character. | ||||||
| This "catch-all" pattern will only be reached if no other token matches. | ||||||
|
|
||||||
| ```scala | ||||||
| val myLexer = lexer: | ||||||
| case "[0-9]+" => Token["NUM"] | ||||||
| // ... other tokens ... | ||||||
| case "\\s+" => Token.Ignored | ||||||
|
|
||||||
| // Catch-all: matches any single character that wasn't matched above | ||||||
| case "." => Token["ERROR"] | ||||||
| ``` | ||||||
|
|
||||||
| By emitting an `ERROR` token, the lexer can continue tokenizing the rest of the input. | ||||||
| Your parser can then decide how to handle these `ERROR` tokens. | ||||||
|
|
||||||
| ## 3. Strategy: Error Counting in Context | ||||||
|
|
||||||
| You can use a custom `LexerCtx` to track the number of errors encountered during tokenization. | ||||||
|
|
||||||
| ```scala | ||||||
| case class ErrorCtx( | ||||||
| var text: CharSequence = "", | ||||||
| var errorCount: Int = 0 | ||||||
| ) extends LexerCtx | ||||||
|
|
||||||
| val myLexer = lexer[ErrorCtx]: | ||||||
| case "[a-z]+" => Token["ID"] | ||||||
| case "\\s+" => Token.Ignored | ||||||
|
|
||||||
| case x @ "." => | ||||||
| ctx.errorCount += 1 | ||||||
| println(s"Error: Unexpected character '$x' at position ${ctx.position}") | ||||||
| Token.Ignored // Skip the character | ||||||
|
Comment on lines
+38
to
+50
|
||||||
| ``` | ||||||
|
|
||||||
| ## 4. Strategy: Error Recovery via Ignored Tokens | ||||||
|
|
||||||
| If you want to simply skip unexpected characters and proceed as if they weren't there, you can use `Token.Ignored` in your catch-all case. | ||||||
|
|
||||||
| ```scala | ||||||
| val resilientLexer = lexer: | ||||||
| case "[0-9]+" => Token["NUM"] | ||||||
| case "\s+" => Token.Ignored | ||||||
|
||||||
| case "\s+" => Token.Ignored | |
| case "\\s+" => Token.Ignored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lexer patterns use
"\("and"\)"(single backslash). In Scala string literals\(/\)are invalid escape sequences; if the intent is to match literal parentheses in a regex, the strings should be escaped as"\\("and"\\)"(or written using triple-quoted strings).