Documentation: extractors tutorial and API docs#299
Documentation: extractors tutorial and API docs#299halotukozak wants to merge 5 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR expands Alpaca’s documentation set by adding new tutorials/guides and standalone API docs for the Lexer and Parser, aiming to help users understand extractors, contextual parsing, conflict resolution, and core APIs.
Changes:
- Added a tutorial explaining token/rule extractors and EBNF helpers (
.List,.Option). - Added new guides for contextual parsing (including
BetweenStages) and conflict resolution. - Added dedicated API documentation pages for the Lexer and Parser.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/_docs/tutorials/extractors.md | New tutorial describing extractor patterns for tokens/rules and EBNF helpers. |
| docs/_docs/lexer.md | New Lexer API documentation with examples for defining lexers, contexts, and tokenization. |
| docs/_docs/parser.md | New Parser API documentation including grammar rules, EBNF operators, conflict resolution, and parsing flow. |
| docs/_docs/guides/contextual-parsing.md | New guide describing lexer/parser context usage and the BetweenStages hook. |
| docs/_docs/guides/conflict-resolution.md | New guide explaining LR conflicts and Alpaca’s before/after resolution DSL. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| case MyLexer.NUM(n) => n.value // n is a Lexem object | ||
| ``` | ||
| The `Lexem` object contains: |
There was a problem hiding this comment.
In this example, n is a Lexeme, but the text calls it a Lexem. This typo is repeated in the following sentence and could confuse readers searching for the type in the API.
|
|
||
| ```scala | ||
| val Decl: Rule[Val] = rule: | ||
| case (MyLexer.VAL(_), MyLexer.ID(id), MyLexer.Type.Option(t) => ... |
There was a problem hiding this comment.
This .Option example snippet is syntactically invalid (missing closing ) / =>) and likely references the wrong symbol (MyLexer.Type doesn’t match the token/rule naming used elsewhere). Please fix the snippet so it compiles and demonstrates .Option on an actual symbol (e.g., a Rule’s .Option).
| case (MyLexer.VAL(_), MyLexer.ID(id), MyLexer.Type.Option(t) => ... | |
| case (MyLexer.VAL(_), MyLexer.ID(id), Expr.Option(optExpr)) => ... |
| Most contextual logic in Alpaca happens at the lexer level. | ||
| Since the lexer tokenizes the entire input before the parser starts, the lexer context is the primary place to track state that affects tokenization. | ||
|
|
||
| ### Exaple: Brace Matching & Nesting |
There was a problem hiding this comment.
Heading typo: “Exaple” → “Example”.
| ### Exaple: Brace Matching & Nesting | |
| ### Example: Brace Matching & Nesting |
| Token["("] | ||
| case "\\)" => | ||
| if ctx.stack.isEmpty || ctx.stack.pop() != "paren" then | ||
| throw RuntimeException("Mismatched parenthesis") |
There was a problem hiding this comment.
throw RuntimeException("Mismatched parenthesis") is not valid Scala; it should be throw new RuntimeException(...) (or a more specific exception). As written, the example won’t compile.
| throw RuntimeException("Mismatched parenthesis") | |
| throw new RuntimeException("Mismatched parenthesis") |
| The `lexer` block uses `LexerCtx.Default` by default, which tracks `line` and `position`. | ||
|
|
||
| ```scala | ||
| val myLexer = lexer: | ||
| case "\n" => | ||
| ctx.line += 1 | ||
| ctx.position = 1 | ||
| Token.Ignored | ||
| case "." => | ||
| ctx.position += 1 |
There was a problem hiding this comment.
The “Default Context” example manually increments ctx.line / ctx.position, but LexerCtx.Default already mixes in LineTracking and PositionTracking, whose BetweenStages hooks update these counters automatically after each match. As written, the example will double-increment on newlines / characters; either remove the manual updates or show this with a custom context that doesn’t include those tracking traits.
| The `lexer` block uses `LexerCtx.Default` by default, which tracks `line` and `position`. | |
| ```scala | |
| val myLexer = lexer: | |
| case "\n" => | |
| ctx.line += 1 | |
| ctx.position = 1 | |
| Token.Ignored | |
| case "." => | |
| ctx.position += 1 | |
| The `lexer` block uses `LexerCtx.Default` by default, which tracks `line` and `position` automatically as input is consumed. | |
| ```scala | |
| val myLexer = lexer: | |
| case "\n" => | |
| // Newlines are ignored; line and position are updated automatically | |
| Token.Ignored | |
| case "." => |
|
|
||
| - **Symbols**: The building blocks of your grammar. They can be: | ||
| - **Terminals**: Tokens from your lexer (e.g., `MyLexer.PLUS`). | ||
| - **Non-Terminals**: Other rules (e.g., `Expr[Double]`). |
There was a problem hiding this comment.
In the “Symbols” section, Expr[Double] is not valid Scala syntax for referring to a rule; it reads like a type application. Consider changing this example to something like val Expr: Rule[Double] = ... (or just Expr) to avoid confusing readers.
| - **Non-Terminals**: Other rules (e.g., `Expr[Double]`). | |
| - **Non-Terminals**: Other rules (e.g., `Expr`). |
| - `value`: The extracted value (e.g., `Double`, `Int`, `String`). | ||
| - `name`: The name of the token. | ||
| - `fields`: A NamedTuple containing context information (like `line` and `position`). |
There was a problem hiding this comment.
The docs say fields is a NamedTuple, but Lexeme.fields is implemented as a Map[String, Any] (see src/alpaca/internal/lexer/Lexeme.scala). Please update this description to match the actual API (and consider noting that field access is via Selectable, e.g. n.line).
|
|
||
| case class BraceCtx( | ||
| var text: CharSequence = "", | ||
| stack: mutable.Stack[String] = mutable.Stack() |
There was a problem hiding this comment.
This snippet won’t compile as written: it imports scala.collection.mutable.Stack but then uses mutable.Stack without importing scala.collection.mutable (or aliasing it). Either use Stack[String] / Stack() in the case class, or import scala.collection.mutable and refer to mutable.Stack.
| stack: mutable.Stack[String] = mutable.Stack() | |
| stack: Stack[String] = Stack() |
| If you need complex logic to run after every match regardless of which token was matched, you can provide a custom `given` instance of `BetweenStages`. | ||
|
|
||
| ```scala | ||
|
|
||
| trait CustomTrait extends LexerCtx: | ||
| var indentLevel: Int | ||
|
|
||
| case class CustomCtx(var text: CharSequence = "", var indentLevel: Int = 0) extends CustomTrait derives BetweenStages | ||
|
|
||
| given BetweenStages[CustomTrait] = new: | ||
| def apply(token: Token[?, MyCtx, ?], matcher: Matcher, ctx: MyCtx): Unit = | ||
| // Custom logic to update indentLevel based on the matched token | ||
| token match | ||
| case Token["INDENT"](_) => ctx.indentLevel += 1 | ||
| case Token["DEDENT"](_) => ctx.indentLevel -= 1 | ||
| case _ => () | ||
| ``` |
There was a problem hiding this comment.
The “Customizing BetweenStages” example is not currently usable from normal user code: BetweenStages is declared private[alpaca] (src/alpaca/internal/lexer/BetweenStages.scala), and derives BetweenStages won’t work (there’s no BetweenStages.derived). Additionally, the snippet references an undefined MyCtx and matches on Token["INDENT"], but Token[...] is compile-time-only DSL syntax, not a runtime pattern for BetweenStages. Please either rework this section to reflect the actual supported customization mechanism, or make BetweenStages a public, user-extensible API.
| If you need complex logic to run after every match regardless of which token was matched, you can provide a custom `given` instance of `BetweenStages`. | |
| ```scala | |
| trait CustomTrait extends LexerCtx: | |
| var indentLevel: Int | |
| case class CustomCtx(var text: CharSequence = "", var indentLevel: Int = 0) extends CustomTrait derives BetweenStages | |
| given BetweenStages[CustomTrait] = new: | |
| def apply(token: Token[?, MyCtx, ?], matcher: Matcher, ctx: MyCtx): Unit = | |
| // Custom logic to update indentLevel based on the matched token | |
| token match | |
| case Token["INDENT"](_) => ctx.indentLevel += 1 | |
| case Token["DEDENT"](_) => ctx.indentLevel -= 1 | |
| case _ => () | |
| ``` | |
| At the moment, `BetweenStages` is an internal implementation detail of Alpaca and is not intended to be customized directly from user code. | |
| The library provides a default `BetweenStages` implementation that: | |
| - advances the `text` field in your `LexerCtx` after each successful match, and | |
| - updates any tracking fields (for example, line/column or position counters) when your context mixes in the appropriate traits. | |
| If you need additional logic to run in response to specific tokens, you should implement it using the public API, for example by: | |
| - mutating your `LexerCtx` inside lexer rule actions, or | |
| - post-processing the produced `Lexeme`s (and their captured context) after lexing. | |
| Future versions of Alpaca may expose `BetweenStages` as a public, user-extensible hook. When that happens, this guide will be updated with a concrete customization example that reflects the supported API. |
|
|
||
| ### Automatic Updates | ||
| By default, Alpaca uses `BetweenStages` to automatically update the `text` field in your context (to advance past the matched string). | ||
| If your context extends `LineTracking` or `PositionTracking`, the defined hooks also increments `line` and `position` counters. |
There was a problem hiding this comment.
Grammar: “hooks also increments” → “hooks also increment”.
| If your context extends `LineTracking` or `PositionTracking`, the defined hooks also increments `line` and `position` counters. | |
| If your context extends `LineTracking` or `PositionTracking`, the defined hooks also increment `line` and `position` counters. |
Summary
🤖 Generated with Claude Code