Conversation
- Create docs/_docs/theory/tokens.md (TH-02) - Terminal symbols definition, token class vs instance distinction - Formal lexeme definition as triple (T, w, pos) for CROSS-02 - CalcLexer 7-token class table with patterns and value types - Canonical CalcLexer definition with sc:nocompile - Tokenization output code example with sc:nocompile - Cross-links to lexer.md and lexer-fa.md
- New docs/_docs/theory/ directory with pipeline.md as opening theory page - Explains four compilation stages: source text, lexical analysis, syntactic analysis, semantic analysis - Documents Alpaca's compile-time vs runtime boundary with standard callout block - Formal definition block using Unicode math: parse ∘ tokenize : String → R - Pipeline code example with sc:nocompile referencing CalcLexer and CalcParser - Type mapping table: pipeline stages to Alpaca types (List[Lexeme], R | Null) - Cross-links to lexer.md and parser.md - No CalcParser grammar notation, no LaTeX, all macro blocks marked sc:nocompile
- Create docs/_docs/theory/lexer-fa.md (TH-03) - Regular language formal definition block for CROSS-02 - NFA/DFA conceptual explanation with state transition table for PLUS token - DFA 5-tuple formal definition (Q, Σ, δ, q₀, F) for CROSS-02 - Combined alternation pattern explanation grounded in Alpaca internals - Shadowing detection via dregex subset checking - Standard compile-time callout block - Cross-links to lexer.md and tokens.md
…finition - Top-down vs bottom-up parsing approaches - Left recursion infinite-loop trace showing LL failure - LR family comparison table: LR(0), SLR(1), LALR(1), LR(1) with Alpaca marked as LR(1) - Why LR(1) vs LALR(1) section grounded in Item.scala/ParseTable.scala source - LR(1) item formal definition using [A → α • β, a] dot notation with examples - O(n) parsing paragraph - Compile-time callout in established blockquote format - Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, ../parser.md
…rmal configuration - Parse stack explanation: (stateIndex, node) pairs from Parser.scala - Parse tables section: parse table + action table with separation of concerns - Simplified 3-production grammar block for trace clarity - 8-row parse trace table for '1 + 2' with Stack | Remaining input | Action columns - Annotation notes for steps 1, 2, 6, 7, 8 - Disclaimer that state numbers are illustrative for simplified grammar - 3 LR(1) item examples with dot notation from Item.scala - LR parse configuration formal definition in blockquote format - Connection to Alpaca runtime loop() function prose reference - O(n) loop termination paragraph - Compile-time callout in established blockquote format - Cross-links to why-lr.md, cfg.md, ../conflict-resolution.md, ../parser.md, pipeline.md
- Formal CFG 4-tuple definition (V, Σ, R, S) in blockquote format - 7-production CalcParser BNF grammar (6 Expr productions + root) - Leftmost derivation for 1 + 2 with ⇒ steps - ASCII parse tree for 1 + 2 - CalcParser Alpaca DSL block annotated with sc:nocompile - Compile-time callout in established blockquote format - Cross-links to tokens.md, why-lr.md, ../parser.md, ../conflict-resolution.md
|
Minimum allowed coverage is Generated by 🐒 cobertura-action against 6ac2409 |
🏃 Runtime Benchmark
|
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## master #288 +/- ##
=========================================
Coverage ? 37.37%
=========================================
Files ? 34
Lines ? 404
Branches ? 0
=========================================
Hits ? 151
Misses ? 253
Partials ? 0 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a set of “Compiler Theory Tutorial” documentation pages explaining Alpaca’s compilation pipeline, lexing concepts, CFGs, and LR parsing (including shift/reduce mechanics and LR(1) vs LALR(1)).
Changes:
- Introduces new theory docs for tokens/lexemes, the pipeline, CFG basics, and lexer → automata concepts.
- Adds LR-focused pages explaining why LR is used and a step-by-step shift/reduce trace.
- Adds cross-links between the new pages and existing reference docs.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/_docs/theory/why-lr.md | Explains LL vs LR, left recursion, and why Alpaca uses full LR(1). |
| docs/_docs/theory/tokens.md | Defines tokens, token classes vs instances, and lexemes with CalcLexer examples. |
| docs/_docs/theory/shift-reduce.md | Walkthrough of the shift/reduce loop with a concrete parse trace. |
| docs/_docs/theory/pipeline.md | Describes the Alpaca compilation pipeline and compile-time vs runtime boundary. |
| docs/_docs/theory/lexer-fa.md | Explains regex → FA theory and relates it to Alpaca’s lexer implementation. |
| docs/_docs/theory/cfg.md | Introduces CFGs and maps the calculator grammar to Alpaca’s parser DSL. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| **Lexeme** — the full record of a token instance: the token class, the matched text, and its | ||
| position in the source. Parsing `"3 + 4"` produces three lexemes: | ||
|
|
||
| - `NUMBER("3", pos=0)` | ||
| - `PLUS("+", pos=2)` | ||
| - `NUMBER("4", pos=4)` | ||
|
|
||
| The word *lexeme* is used throughout this documentation to mean this complete record. | ||
|
|
||
| > **Definition — Lexeme:** | ||
| > A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string | ||
| > (a member of the language defined by T's regex), and pos is the position of the end of the | ||
| > match in the source text. | ||
| > In Alpaca: `Lexeme[Name, Value]` where `Name` is the token class name (a string literal type) | ||
| > and `Value` is the Scala type of the extracted value. | ||
|
|
||
| ## Alpaca's Lexeme Type | ||
|
|
||
| In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four | ||
| pieces of information: | ||
|
|
||
| - `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"` | ||
| - `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit` | ||
| for PLUS | ||
| - `position` — the character offset at the end of the match | ||
| - `line` — the line number at the end of the match | ||
|
|
There was a problem hiding this comment.
This section describes pos/position as an absolute character offset in the whole source and implies every lexeme always has position/line. In the implementation, LexerCtx.Default.position is a 1-based column within the current line (reset on newline), and lexeme fields depend on the chosen lexer context (e.g., LexerCtx.Empty has no line/position). Please adjust the definition and examples to match the actual semantics (line + column-at-end-of-match, when the context tracks them).
| | Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) | | ||
| | Semantic analysis | parse tree | typed result | `R \| Null` (your root type) | | ||
|
|
||
| The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree. |
There was a problem hiding this comment.
The stage mapping and following paragraph continue to describe the syntactic stage as producing a "parse tree (internal)". This conflicts with the actual parser runtime, which only maintains an LR stack and computes results on reductions without constructing a tree object. Consider describing this as a conceptual parse structure or as "LR(1) stack + reductions" rather than an internal parse tree.
| | Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) | | |
| | Semantic analysis | parse tree | typed result | `R \| Null` (your root type) | | |
| The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree. | |
| | Syntactic analysis | `List[Lexeme]` | conceptual parse structure (via LR(1) stack + reductions) | LR(1) stack + reductions (internal) | | |
| | Semantic analysis | conceptual parse structure | typed result | `R \| Null` (your root type) | | |
| Alpaca never constructs or returns an explicit parse tree object. Instead, it uses an LR(1) stack and applies your semantic actions (the `=>` expressions in `rule` definitions) on each reduction, so what you get back from `parse` is the final typed value, not an intermediate tree. |
| A lexer reads a character stream from left to right and emits a token stream. Each scan step | ||
| finds the longest prefix of the remaining input that matches one of the token class patterns — | ||
| this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws | ||
| an error. The result is a flat list of lexemes that the parser consumes next. |
There was a problem hiding this comment.
The lexer implementation does not enforce a global "maximal munch" across token patterns; it matches the first alternative that succeeds in the combined ordered regex (rule-order priority), not necessarily the longest among all token classes. Please update this description to match the actual "patterns are tried in order; first match wins" behavior.
| A lexer reads a character stream from left to right and emits a token stream. Each scan step | |
| finds the longest prefix of the remaining input that matches one of the token class patterns — | |
| this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws | |
| an error. The result is a flat list of lexemes that the parser consumes next. | |
| A lexer reads a character stream from left to right and emits a token stream. At each position, | |
| it tries the token class patterns in their specified order and picks the first one whose regex | |
| matches a prefix of the remaining input — patterns are tried in order; first match wins. When no | |
| pattern matches the current position, the lexer throws an error. The result is a flat list of | |
| lexemes that the parser consumes next. |
|
|
||
| - See [Lexer](../lexer.md) for the complete `lexer` DSL reference. | ||
| - See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream. | ||
| - Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed. |
There was a problem hiding this comment.
The cross-link theory/cfg.md is incorrect from within docs/_docs/theory/lexer-fa.md (it resolves to .../theory/theory/cfg.md). It should link to cfg.md (same directory).
| - Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed. | |
| - Next: [Context-Free Grammars](cfg.md) for how token streams are parsed. |
| (1.0) (2.0) | ||
| ``` | ||
|
|
||
| Note: In Alpaca, the parse tree is never exposed to user code. The `Parser` macro builds it internally during the shift-reduce parse, and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each node is reduced. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree object. (See [The Compilation Pipeline](pipeline.md) for the full picture.) |
There was a problem hiding this comment.
This note says the Parser macro "builds" a parse tree internally. The runtime implementation performs shift/reduce and immediately evaluates semantic actions; it does not construct or retain a parse tree object (see src/alpaca/internal/parser/Parser.scala). Please rephrase to avoid contradicting the shift/reduce page (which states no parse tree is constructed).
| Note: In Alpaca, the parse tree is never exposed to user code. The `Parser` macro builds it internally during the shift-reduce parse, and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each node is reduced. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree object. (See [The Compilation Pipeline](pipeline.md) for the full picture.) | |
| Note: In Alpaca, parse trees are a conceptual model only; the runtime LR parser does not construct or retain an explicit parse-tree object. During shift–reduce parsing it immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each production is reduced, and the `Parser` macro’s job is to analyze those rules at compile time and generate the LR parse tables. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree structure. (See [The Compilation Pipeline](pipeline.md) for the full picture.) |
| that appear in source text. In a lexer, each token class acts as a terminal: it names a category | ||
| of strings, and no lexer-level expansion applies below it. | ||
|
|
||
| See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules. |
There was a problem hiding this comment.
The relative link theory/cfg.md is incorrect from within docs/_docs/theory/tokens.md (it resolves to .../theory/theory/cfg.md). It should link to cfg.md (same directory) to avoid a broken cross-link.
| See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules. | |
| See [Context-Free Grammars](cfg.md) for how terminals fit into production rules. |
| // result: Double | Null = 11.0 | ||
| ``` | ||
|
|
||
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result. |
There was a problem hiding this comment.
CalcParser.parse does not build a parse tree internally; the runtime performs shift/reduce and applies semantic actions directly (see src/alpaca/internal/parser/Parser.scala, where reductions call actionTable(prod)(ctx, children) and no tree structure is retained). This sentence should be reworded to avoid implying a concrete internal parse tree data structure.
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result. | |
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it consumes those lexemes using the generated LR(1) parse table and your semantic actions to compute the typed result, without constructing an explicit parse tree data structure. |
| - Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca | ||
| - [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them |
There was a problem hiding this comment.
These cross-links use theory/... paths even though this file already lives in docs/_docs/theory/. As written, they resolve to .../theory/theory/... and break. They should link to tokens.md and lexer-fa.md (same directory).
| - Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca | |
| - [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them | |
| - Next: [Tokens & Lexemes](tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca | |
| - [The Lexer: Regex to Finite Automata](lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them |
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | ||
| through the input, no backtracking. |
There was a problem hiding this comment.
The claim that using Java regex implies an O(n) lexer with "no backtracking" is not accurate. Tokenization.tokenize() uses java.util.regex.Pattern/Matcher.lookingAt(), and Java's regex engine can backtrack with super-linear (even exponential) worst-cases depending on patterns; additionally, ctx.text = ctx.text.from(m.end) will create new subsequences for String inputs. Please soften this to an implementation description (combined ordered regex) without promising DFA-like complexity guarantees unless you can justify them for all supported patterns.
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | |
| through the input, no backtracking. | |
| In practice, this means Alpaca's lexer uses a single pre-compiled combined regex and scans | |
| through the input from left to right, matching at each position with `lookingAt()` and using | |
| the named capturing groups to determine the token class; the exact performance and any | |
| backtracking behavior are determined by the Java regex engine and the specific token patterns. |
| |-----------|-------------------|-------------|-------| | ||
| | LR(0) | None (reduce always) | Smallest | Too weak for most real grammars | | ||
| | SLR(1) | FOLLOW sets (global per non-terminal) | Same as LR(0) | Better, still limited | | ||
| | LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) | |
There was a problem hiding this comment.
ANTLR is not an LALR(1) parser generator; it uses an LL(*) / adaptive prediction approach. Listing it as an example of LALR(1) here is misleading—please remove ANTLR from the LALR(1) examples or replace it with an actual LALR(1) tool.
| | LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) | | |
| | LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison) | |
📊 Test Compilation Benchmark
Result: Current branch is 6.691s faster (11.95%) ✅ |
Summary
Add compiler theory foundation pages to the documentation, covering formal CS concepts that underpin Alpaca's design.
Pages added
Also includes
InconsistentConflictResolution,AlpacaTimeoutException)🤖 Generated with Claude Code