From a5b61fd6bf121dbd32311ab2f28a0949c2f3c143 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bart=C5=82omiej=20Kozak?= Date: Mon, 16 Feb 2026 23:45:07 +0100 Subject: [PATCH] docs: add advanced guides for conflict resolution, contextual parsing and lexer error handling --- docs/_docs/guides/conflict-resolution.md | 107 +++++++++++++++++ docs/_docs/guides/contextual-parsing.md | 134 ++++++++++++++++++++++ docs/_docs/guides/lexer-error-handling.md | 90 +++++++++++++++ 3 files changed, 331 insertions(+) create mode 100644 docs/_docs/guides/conflict-resolution.md create mode 100644 docs/_docs/guides/contextual-parsing.md create mode 100644 docs/_docs/guides/lexer-error-handling.md diff --git a/docs/_docs/guides/conflict-resolution.md b/docs/_docs/guides/conflict-resolution.md new file mode 100644 index 00000000..46f5b69e --- /dev/null +++ b/docs/_docs/guides/conflict-resolution.md @@ -0,0 +1,107 @@ +# Guide: Conflict Resolution + +LR parsers, like the one generated by Alpaca, can encounter **conflicts** when the grammar is ambiguous. A conflict occurs when the parser has multiple valid actions for a given state and lookahead token. + +Alpaca provides a declarative DSL to resolve these conflicts at compile-time, ensuring your parser is deterministic and behaves as expected. + +## 1. Types of Conflicts + +### Shift/Reduce Conflict +Occurs when the parser can either **shift** (move the current token to the stack) or **reduce** (apply a grammar rule to symbols already on the stack). + +**Example (Dangling Else):** +```scala +if (cond) if (cond) stmt else stmt +``` +Should the `else` belong to the first `if` or the second? + +### Reduce/Reduce Conflict +Occurs when the parser has two or more different rules that could be applied to the same symbols on the stack. This usually indicates a more serious ambiguity in the grammar. + +## 2. Resolving Conflicts in Alpaca + +Conflicts are resolved by overriding the `resolutions` member in your `Parser` object. This member is a `Set` of `ConflictResolution` rules. + +```scala +object MyParser extends Parser: + // ... rules ... + + override val resolutions = Set( + // Resolve conflicts here + ) +``` + +### The `before` and `after` Operators + +Alpaca uses `before` and `after` to define precedence: + +- **`A after B`**: `A` has **higher precedence** than `B`. `A` will be reduced **later** than `B` (or `B` will be shifted while `A` is waiting). +- **`A before B`**: `A` has **lower precedence** than `B`. `A` will be reduced **earlier** than `B`. + +You can use these operators with both **Tokens** (Terminals) and **Productions** (Rules). + +#### Operator Precedence Example +```scala +override val resolutions = Set( + MyLexer.STAR after MyLexer.PLUS, + MyLexer.SLASH after MyLexer.PLUS +) +``` +This tells the parser that `*` and `/` have higher precedence than `+`. + +#### Associativity Example +To define **left-associativity** (e.g., `1 + 2 + 3` is `(1 + 2) + 3`), a production should be "before" its own recursive tokens. + +```scala + val Expr: Rule[Int] = rule( + "add" { case (Expr(l), MyLexer.PLUS(_), Expr(r)) => l + r } + ) + + override val resolutions = Set( + production.add before MyLexer.PLUS + ) +``` + +## 3. Named Productions + +For fine-grained control, you can assign names to specific productions within a rule. This allows you to reference that exact production in your `resolutions`. + +```scala +val Expr: Rule[Int] = rule( + "mul" { case (Expr(l), MyLexer.STAR(_), Expr(r)) => l * r }, + "div" { case (Expr(l), MyLexer.SLASH(_), Expr(r)) => l / r }, + { case MyLexer.NUM(n) => n.value } +) + +override val resolutions = Set( + production.mul after MyLexer.PLUS, + production.div after MyLexer.PLUS +) +``` + +## 4. Understanding Conflict Errors + +When Alpaca detects an unresolved conflict, it produces a detailed compile-time error message. + +**Example Error:** +```text +Shift "+ ($plus)" vs Reduce Expr -> Expr + ($plus) Expr +In situation like: +Expr + ($plus) Expr + ($plus) ... +Consider marking production Expr -> Expr + ($plus) Expr to be alwaysBefore or alwaysAfter "+ ($plus)" +``` + +### How to read it: +1. **The Conflict**: It tells you exactly which actions are clashing (Shift `+` vs Reduce `Expr`). +2. **The Context**: "In situation like..." shows you a sample input sequence that triggers the ambiguity. +3. **The Suggestion**: It suggests which production and token need a resolution rule. + +## 5. Debugging with Graphs + +If enabled in [Debug Settings](../debug-settings.html), Alpaca can generate internal representations of the parse table and conflict resolutions to help you visualize complex grammars. + +## Best Practices + +1. **Keep it Minimal**: Only add resolutions for actual conflicts reported by the compiler. +2. **Use Named Productions**: They make your intentions clearer than referencing raw tokens or entire rules. +3. **Think in Terms of Trees**: Remember that "higher precedence" (`after`) means the operation will appear **lower** in the resulting AST (it stays on the stack longer). diff --git a/docs/_docs/guides/contextual-parsing.md b/docs/_docs/guides/contextual-parsing.md new file mode 100644 index 00000000..467291b8 --- /dev/null +++ b/docs/_docs/guides/contextual-parsing.md @@ -0,0 +1,134 @@ +# Guide: Contextual Parsing + +Contextual parsing refers to the ability of a lexer or parser to change its behavior or maintain state based on the input it has already seen. Alpaca provides powerful mechanisms for this through `LexerCtx`, `ParserCtx`, and the `BetweenStages` hook. + +## 1. Lexer-Level Context + +Most contextual logic in Alpaca happens at the lexer level. Since the lexer tokenizes the entire input before the parser starts, the lexer context is the primary place to track state that affects tokenization. + +### Pattern: Brace Matching & Nesting + +You can use a context to track nesting levels of braces, parentheses, or brackets. + +```scala +import alpaca.* +import scala.collection.mutable.Stack + +case class BraceCtx( + var text: CharSequence = "", + val stack: Stack[String] = Stack() +) extends LexerCtx + +val myLexer = lexer[BraceCtx]: + case "\(" => + ctx.stack.push("paren") + Token["("] + case "\)" => + if ctx.stack.isEmpty || ctx.stack.pop() != "paren" then + throw RuntimeException("Mismatched parenthesis") + Token[")"] +``` + +### Pattern: Indentation-Based Parsing + +For languages like Python or Scala (with `braceless` syntax), you can track indentation levels in the context and emit virtual `INDENT` and `OUTDENT` tokens (or just adjust state for later use). + +```scala +case class IndentCtx( + var text: CharSequence = "", + var currentIndent: Int = 0, + val indents: Stack[Int] = Stack(0) +) extends LexerCtx + +val pythonicLexer = lexer[IndentCtx]: + case x @ " + +" => + val newIndent = x.length - 1 + if newIndent > ctx.currentIndent then + ctx.indents.push(newIndent) + ctx.currentIndent = newIndent + Token["INDENT"] + else if newIndent < ctx.currentIndent then + // ... logic to emit multiple OUTDENTs ... + Token["OUTDENT"] + else + Token.Ignored +``` + +## 2. Accessing Lexer Context in the Parser + +Every `Lexeme` matched by the lexer carries a "snapshot" of the `LexerCtx` as it was at the moment that specific token was matched. + +This is extremely useful for error reporting or for logic that depends on when a token appeared. + +```scala +object MyParser extends Parser: + val root = rule: + case MyLexer.ID(id) => + // id is a Lexeme, which has a .fields property + // fields contains all members of your LexerCtx + println(s"Matched ID at line ${id.fields.line}") + id.value +``` + +## 3. Parser-Level Context (`ParserCtx`) + +`ParserCtx` is for maintaining state during the reduction process. This is where you build symbol tables, track variable declarations, or perform type checking. + +```scala +case class MyParserCtx(var symbols: Map[String, Type] = Map()) extends ParserCtx + +object MyParser extends Parser[MyParserCtx]: + val root = rule: + case Decl(d) => d + + val Decl = rule: + case (MyLexer.VAR(_), MyLexer.ID(id), MyLexer.TYPE(t)) => + ctx.symbols += (id.value -> t.value) + // ... +``` + +## 4. Mode Switching (Lexical Feedback) + +Sometimes you need to change how the lexer behaves based on what it just matched. For example, when parsing a string with interpolation: +`"Hello ${user.name}!"` + +While Alpaca doesn't support "real-time" feedback from the parser to the lexer (as the lexer finishes first), you can implement modes within the lexer using context state. + +```scala +case class ModeCtx(var inString: Boolean = false, var text: CharSequence = "") extends LexerCtx + +val modeLexer = lexer[ModeCtx]: + case """ => + ctx.inString = !ctx.inString + Token["QUOTE"] + + case "[a-z]+" if !ctx.inString => Token["KEYWORD"] + case "[^"]+" if ctx.inString => Token["STRING_CONTENT"] +``` + +## 5. The `BetweenStages` Hook + +The `BetweenStages` hook is the internal engine that powers context updates. It is a function called by Alpaca after **every** token match (including `Token.Ignored`) but **before** the next match starts. + +### Automatic Updates +By default, Alpaca uses `BetweenStages` to automatically update the `text` field in your context. If your context extends `LineTracking` or `PositionTracking`, it also increments `line` and `position` counters. + +### Customizing `BetweenStages` +If you need complex logic to run after every match, you can provide a custom `given` instance of `BetweenStages`. + +```scala +given MyBetweenStages: BetweenStages[MyCtx] with + def apply(token: Token[?, MyCtx, ?], matcher: Matcher, ctx: MyCtx): Unit = + // Custom global logic + println(s"Just matched ${token.info.name}") +``` + +## Summary of Data Flow + +1. **Input String** flows into the `lexer`. +2. **`BetweenStages`** updates the `LexerCtx` after every match. +3. **`Lexeme`s** are produced, each capturing the current `LexerCtx` state. +4. **List[Lexeme]** flows into the `parser`. +5. **`ParserCtx`** is initialized and updated as rules are reduced. +6. **Result** is produced, along with the final `ParserCtx`. diff --git a/docs/_docs/guides/lexer-error-handling.md b/docs/_docs/guides/lexer-error-handling.md new file mode 100644 index 00000000..dcb1a900 --- /dev/null +++ b/docs/_docs/guides/lexer-error-handling.md @@ -0,0 +1,90 @@ +# Guide: Lexer Error Handling + +By default, Alpaca's lexer is strict: it throws a `RuntimeException` as soon as it encounters a character that doesn't match any of your defined token patterns. + +However, for real-world applications, you often want the lexer to be more resilient, either by reporting multiple errors or by skipping invalid characters and continuing. +This guide explores strategies for implementing custom error handling in your lexer. + +## 1. Default Behavior + +When `tokenize` fails to find a match, it throws an exception: +```text +java.lang.RuntimeException: Unexpected character: '?' +``` + +## 2. Strategy: Catch-all Error Token + +The most common strategy is to add a pattern at the **end** of your lexer that matches any single character. +This "catch-all" pattern will only be reached if no other token matches. + +```scala +val myLexer = lexer: + case "[0-9]+" => Token["NUM"] + // ... other tokens ... + case "\\s+" => Token.Ignored + + // Catch-all: matches any single character that wasn't matched above + case "." => Token["ERROR"] +``` + +By emitting an `ERROR` token, the lexer can continue tokenizing the rest of the input. +Your parser can then decide how to handle these `ERROR` tokens. + +## 3. Strategy: Error Counting in Context + +You can use a custom `LexerCtx` to track the number of errors encountered during tokenization. + +```scala +case class ErrorCtx( + var text: CharSequence = "", + var errorCount: Int = 0 +) extends LexerCtx + +val myLexer = lexer[ErrorCtx]: + case "[a-z]+" => Token["ID"] + case "\\s+" => Token.Ignored + + case x @ "." => + ctx.errorCount += 1 + println(s"Error: Unexpected character '$x' at position ${ctx.position}") + Token.Ignored // Skip the character +``` + +## 4. Strategy: Error Recovery via Ignored Tokens + +If you want to simply skip unexpected characters and proceed as if they weren't there, you can use `Token.Ignored` in your catch-all case. + +```scala +val resilientLexer = lexer: + case "[0-9]+" => Token["NUM"] + case "\s+" => Token.Ignored + + // Log and ignore + case x @ "." => + reportError(x) + Token.Ignored +``` + +## 5. Implementation Considerations + +### Precedence +Always place your catch-all pattern at the **very bottom** of your `lexer` block. Since Alpaca tries to match patterns in order (or uses the longest match with precedence), putting a `.` at the top would match everything and shadow your other rules. + +### Performance +A catch-all `.` pattern can slightly impact performance if your input contains many invalid sequences, as it will be matched character-by-character. However, for most use cases, the overhead is negligible. + +### Parser Integration +If your lexer produces `ERROR` tokens, your parser rules should be prepared to handle them, or you should filter them out before passing the lexeme list to `parser.parse()`. + +```scala +val (finalCtx, lexemes) = myLexer.tokenize(input) +val validLexemes = lexemes.filter(_.name != "ERROR") +val result = myParser.parse(validLexemes) +``` + +## Summary + +- **Default**: Fails fast with an exception. +- **Catch-all**: Use `case "."` to capture invalid characters. +- **Resilience**: Use `LexerCtx` to track and report errors without stopping. +- **Emitters**: Choose between emitting an `ERROR` token or using `Token.Ignored` to skip.