docs: compiler theory section with gap fixes#309
docs: compiler theory section with gap fixes#309halotukozak wants to merge 14 commits intomasterfrom
Conversation
- Create docs/_docs/theory/tokens.md (TH-02) - Terminal symbols definition, token class vs instance distinction - Formal lexeme definition as triple (T, w, pos) for CROSS-02 - CalcLexer 7-token class table with patterns and value types - Canonical CalcLexer definition with sc:nocompile - Tokenization output code example with sc:nocompile - Cross-links to lexer.md and lexer-fa.md
- New docs/_docs/theory/ directory with pipeline.md as opening theory page - Explains four compilation stages: source text, lexical analysis, syntactic analysis, semantic analysis - Documents Alpaca's compile-time vs runtime boundary with standard callout block - Formal definition block using Unicode math: parse ∘ tokenize : String → R - Pipeline code example with sc:nocompile referencing CalcLexer and CalcParser - Type mapping table: pipeline stages to Alpaca types (List[Lexeme], R | Null) - Cross-links to lexer.md and parser.md - No CalcParser grammar notation, no LaTeX, all macro blocks marked sc:nocompile
- Create docs/_docs/theory/lexer-fa.md (TH-03) - Regular language formal definition block for CROSS-02 - NFA/DFA conceptual explanation with state transition table for PLUS token - DFA 5-tuple formal definition (Q, Σ, δ, q₀, F) for CROSS-02 - Combined alternation pattern explanation grounded in Alpaca internals - Shadowing detection via dregex subset checking - Standard compile-time callout block - Cross-links to lexer.md and tokens.md
…finition - Top-down vs bottom-up parsing approaches - Left recursion infinite-loop trace showing LL failure - LR family comparison table: LR(0), SLR(1), LALR(1), LR(1) with Alpaca marked as LR(1) - Why LR(1) vs LALR(1) section grounded in Item.scala/ParseTable.scala source - LR(1) item formal definition using [A → α • β, a] dot notation with examples - O(n) parsing paragraph - Compile-time callout in established blockquote format - Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, ../parser.md
…rmal configuration - Parse stack explanation: (stateIndex, node) pairs from Parser.scala - Parse tables section: parse table + action table with separation of concerns - Simplified 3-production grammar block for trace clarity - 8-row parse trace table for '1 + 2' with Stack | Remaining input | Action columns - Annotation notes for steps 1, 2, 6, 7, 8 - Disclaimer that state numbers are illustrative for simplified grammar - 3 LR(1) item examples with dot notation from Item.scala - LR parse configuration formal definition in blockquote format - Connection to Alpaca runtime loop() function prose reference - O(n) loop termination paragraph - Compile-time callout in established blockquote format - Cross-links to why-lr.md, cfg.md, ../conflict-resolution.md, ../parser.md, pipeline.md
- Formal CFG 4-tuple definition (V, Σ, R, S) in blockquote format - 7-production CalcParser BNF grammar (6 Expr productions + root) - Leftmost derivation for 1 + 2 with ⇒ steps - ASCII parse tree for 1 + 2 - CalcParser Alpaca DSL block annotated with sc:nocompile - Compile-time callout in established blockquote format - Cross-links to tokens.md, why-lr.md, ../parser.md, ../conflict-resolution.md
- Formal definition block for Parse Table Conflict (state/symbol pair collision) - Shift/reduce conflict section with CalcParser 1+2+3 example and real Alpaca error message - alwaysBefore/alwaysAfter discrepancy note immediately after error block - Reduce/reduce conflict section with Integer/Float example and error message - LR(1) lookahead disambiguation section - Resolution by priority section with minimal sc:nocompile example (production.plus only) - Compile-time detection section with standard blockquote callout - Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, semantic-actions.md, full-example.md
- Six-step narrative from bare grammar to working calculator (7.0) - CalcLexer definition, bare CalcParser with ShiftReduceConflict error - Resolved CalcParser with all 6 resolutions using production.div (not production.divide) - Pipeline evaluation: 1+2*3=7.0, (1+2)*3=9.0 with null-check note - Semantic action trace for 1+2*3 showing 2*3 reduces before 1+... - Formal definition block, compile-time callout blockquote - Theory-to-code mapping table with cross-links to all theory pages
- Formal definition block for Semantic Action (S-attributed scheme) - Syntax-directed translation section with synthesized attribute explanation - Extractor pattern section with complete 7-production CalcParser action table - No-parse-tree section grounded in Parser.scala loop() implementation - Typed results section explaining Rule[Double] compile-time type checking - Compile-time processing callout - Cross-links to shift-reduce.md, conflicts.md, ../extractors.md, ../parser.md, full-example.md - No Rule[Int], no n.value.toDouble, no inherited attribute, no L-attributed
- Add 'Compiler Theory' subsection with 9 theory pages in pipeline order - Pages use theory/pagename.md format resolving to docs/_docs/theory/ - Order: pipeline, tokens, lexer-fa, cfg, why-lr, shift-reduce, conflicts, semantic-actions, full-example
- pipeline.md: tokens.md and lexer-fa.md sibling links no longer use theory/ prefix - pipeline.md: lexer.md and parser.md reference doc links now use ../ prefix - tokens.md: cfg.md sibling link no longer uses theory/ prefix - lexer-fa.md: cfg.md sibling link no longer uses theory/ prefix
…duce section - Inserted identical correction blockquote after the reduce/reduce compiler output block - Readers who encounter only the RR error message now learn that alwaysBefore/alwaysAfter do not exist in Alpaca API - Correct methods are before/after per conflict-resolution.md
…y pages - semantic-actions.md: replace backtick code span with functional [Parser](../parser.md) hyperlink - shift-reduce.md: add Next: [Conflicts and Disambiguation](conflicts.md) bullet to Cross-links - tokens.md: add Next: [The Lexer: Regex to Finite Automata](lexer-fa.md) bullet to Cross-links
🏃 Runtime Benchmark
|
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## master #309 +/- ##
=======================================
Coverage 42.03% 42.03%
=======================================
Files 35 35
Lines 433 433
=======================================
Hits 182 182
Misses 251 251 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new “Compiler Theory” documentation section that explains Alpaca’s compilation pipeline and LR parsing model, and wires it into the docs sidebar to improve navigation and cross-linking across the parser/lexer docs.
Changes:
- Add “Compiler Theory” nested section to
docs/sidebar.yml. - Introduce a set of new theory pages (pipeline, tokens/lexemes, regex→automata, CFGs, LR motivation, shift-reduce, conflicts, semantic actions, full worked example).
- Add/update cross-links between theory pages and existing reference docs (parser, lexer, conflict resolution, extractors).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/sidebar.yml | Adds the new “Compiler Theory” section and links to the new pages. |
| docs/_docs/theory/pipeline.md | Introduces the compilation pipeline framing and compile-time vs runtime boundary. |
| docs/_docs/theory/tokens.md | Defines tokens/lexemes and relates them to Alpaca’s lexer output types. |
| docs/_docs/theory/lexer-fa.md | Explains regex→automata and how Alpaca’s lexer implementation relates to that theory. |
| docs/_docs/theory/cfg.md | Introduces CFGs and maps the calculator grammar to Alpaca’s rule DSL. |
| docs/_docs/theory/why-lr.md | Motivates LR parsing vs LL and explains LR family variants. |
| docs/_docs/theory/shift-reduce.md | Walks through the shift/reduce loop with a concrete parse trace. |
| docs/_docs/theory/conflicts.md | Explains shift/reduce and reduce/reduce conflicts and Alpaca’s resolution DSL. |
| docs/_docs/theory/semantic-actions.md | Explains syntax-directed translation and Alpaca’s semantic action model. |
| docs/_docs/theory/full-example.md | Provides an end-to-end calculator example including conflict resolution and evaluation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | ||
| through the input, no backtracking. |
There was a problem hiding this comment.
The claim here that the lexer has the same O(n) guarantee as a hand-built DFA and does “no backtracking” isn’t accurate with java.util.regex.Pattern (which is primarily a backtracking engine and can be super-linear for some regexes). Please soften/correct this to avoid promising DFA-like worst-case performance unless Alpaca enforces a safe regex subset.
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | |
| through the input, no backtracking. | |
| This means Alpaca's lexer processes the input in a single left-to-right pass using a combined | |
| regex. For typical lexer-style patterns this behaves similarly to a DFA-based lexer, but the | |
| actual worst-case performance is determined by `java.util.regex.Pattern` and the specific | |
| regexes you use, which may involve backtracking. |
| |-----------|-------------------|-------------|-------| | ||
| | LR(0) | None (reduce always) | Smallest | Too weak for most real grammars | | ||
| | SLR(1) | FOLLOW sets (global per non-terminal) | Same as LR(0) | Better, still limited | | ||
| | LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) | |
There was a problem hiding this comment.
ANTLR is listed here as an LALR(1) parser generator, but ANTLR is a top-down LL(*)-style parser generator (not LALR(1)). Please remove ANTLR from this example list or replace it with an actual LALR(1) tool.
| | LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) | | |
| | LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison) | |
| position in the source. Parsing `"3 + 4"` produces three lexemes: | ||
|
|
||
| - `NUMBER("3", pos=0)` | ||
| - `PLUS("+", pos=2)` | ||
| - `NUMBER("4", pos=4)` | ||
|
|
||
| The word *lexeme* is used throughout this documentation to mean this complete record. | ||
|
|
||
| > **Definition — Lexeme:** | ||
| > A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string | ||
| > (a member of the language defined by T's regex), and pos is the position of the end of the | ||
| > match in the source text. |
There was a problem hiding this comment.
These example lexemes use pos as a 0-based absolute offset (pos=0/2/4). Alpaca’s built-in LexerCtx.Default instead tracks position as a 1-based column position (and lexeme fields depend on the chosen context). Please adjust the example field name/values or label it as a conceptual example rather than Alpaca’s concrete fields.
| position in the source. Parsing `"3 + 4"` produces three lexemes: | |
| - `NUMBER("3", pos=0)` | |
| - `PLUS("+", pos=2)` | |
| - `NUMBER("4", pos=4)` | |
| The word *lexeme* is used throughout this documentation to mean this complete record. | |
| > **Definition — Lexeme:** | |
| > A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string | |
| > (a member of the language defined by T's regex), and pos is the position of the end of the | |
| > match in the source text. | |
| position in the source. With Alpaca’s default `LexerCtx.Default`, parsing `"3 + 4"` produces | |
| three lexemes whose `position` field is the 1-based column at the end of each token: | |
| - `NUMBER("3", position=1)` | |
| - `PLUS("+", position=3)` | |
| - `NUMBER("4", position=5)` | |
| The word *lexeme* is used throughout this documentation to mean this complete record. | |
| > **Definition — Lexeme (conceptual):** | |
| > Conceptually, a *lexeme* is a triple (T, w, position) where T is a token class, w ∈ L(T) is | |
| > the matched string (a member of the language defined by T's regex), and `position` is the | |
| > location of the end of the match in the source text, according to whatever coordinate system | |
| > the lexer context uses (for example, column within the current line). |
| - `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"` | ||
| - `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit` | ||
| for PLUS | ||
| - `position` — the character offset at the end of the match | ||
| - `line` — the line number at the end of the match |
There was a problem hiding this comment.
These bullets describe position as a “character offset at the end of the match”. In Alpaca’s default lexer context, position is a 1-based column position within the current line (and lexeme fields depend on the chosen LexerCtx). Please adjust this wording so it matches the actual position semantics and doesn’t imply a fixed, always-present absolute offset.
|
|
||
| The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input. | ||
|
|
||
| Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. |
There was a problem hiding this comment.
This page defines a 4-stage pipeline and says CalcParser.parse handles stages 3–4, but later says “Alpaca covers stages 1–3 of the classical pipeline.” That’s contradictory given the earlier definition. Please make the stage numbering consistent (either Alpaca covers 1–4 in this tutorial framing, or redefine which stages Alpaca implements).
| Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. | |
| Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. |
| > A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string | ||
| > (a member of the language defined by T's regex), and pos is the position of the end of the | ||
| > match in the source text. | ||
| > In Alpaca: `Lexeme[Name, Value]` where `Name` is the token class name (a string literal type) | ||
| > and `Value` is the Scala type of the extracted value. |
There was a problem hiding this comment.
The definition describes pos as “the position of the end of the match in the source text.” In Alpaca, position/line come from the lexer context (e.g., PositionTracking/LineTracking) and are column/line oriented rather than a single absolute offset, and custom contexts can change what’s tracked. Please refine this definition so it matches Alpaca’s actual position model (or clearly label it as a language-theory definition separate from Alpaca’s fields).
| A lexer reads a character stream from left to right and emits a token stream. Each scan step | ||
| finds the longest prefix of the remaining input that matches one of the token class patterns — | ||
| this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws | ||
| an error. The result is a flat list of lexemes that the parser consumes next. |
There was a problem hiding this comment.
This page says the lexer uses the “maximal munch” (longest-prefix) rule across token classes. Alpaca’s implementation uses a single ordered Java-regex alternation and effectively picks the first alternative that matches at the current position, so it is not guaranteed to choose the longest match among all token patterns. Please reword this section to describe the actual ordered “first match wins” behavior (and how to order rules to get the desired result).
| A lexer reads a character stream from left to right and emits a token stream. Each scan step | |
| finds the longest prefix of the remaining input that matches one of the token class patterns — | |
| this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws | |
| an error. The result is a flat list of lexemes that the parser consumes next. | |
| A lexer reads a character stream from left to right and emits a token stream. At each scan step, | |
| it tries the token class patterns in their declared order and picks the first pattern whose regex | |
| matches a prefix of the remaining input — this is an ordered “first match wins” rule. To prefer | |
| longer or more specific tokens over shorter prefixes, place their rules earlier in the list. When | |
| no pattern matches the current position, the lexer throws an error. The result is a flat list of | |
| lexemes that the parser consumes next. |
📊 Test Compilation Benchmark
Result: Current branch is 0.620s slower (1.24%) |
Summary
🤖 Generated with Claude Code