Complete v1.2: Cookbook pages and tech debt cleanup#266
Complete v1.2: Cookbook pages and tech debt cleanup#266halotukozak wants to merge 16 commits intogrammar-theoryfrom
Conversation
- Formal definition block for Parse Table Conflict (state/symbol pair collision) - Shift/reduce conflict section with CalcParser 1+2+3 example and real Alpaca error message - alwaysBefore/alwaysAfter discrepancy note immediately after error block - Reduce/reduce conflict section with Integer/Float example and error message - LR(1) lookahead disambiguation section - Resolution by priority section with minimal sc:nocompile example (production.plus only) - Compile-time detection section with standard blockquote callout - Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, semantic-actions.md, full-example.md
- Six-step narrative from bare grammar to working calculator (7.0) - CalcLexer definition, bare CalcParser with ShiftReduceConflict error - Resolved CalcParser with all 6 resolutions using production.div (not production.divide) - Pipeline evaluation: 1+2*3=7.0, (1+2)*3=9.0 with null-check note - Semantic action trace for 1+2*3 showing 2*3 reduces before 1+... - Formal definition block, compile-time callout blockquote - Theory-to-code mapping table with cross-links to all theory pages
- Formal definition block for Semantic Action (S-attributed scheme) - Syntax-directed translation section with synthesized attribute explanation - Extractor pattern section with complete 7-production CalcParser action table - No-parse-tree section grounded in Parser.scala loop() implementation - Typed results section explaining Rule[Double] compile-time type checking - Compile-time processing callout - Cross-links to shift-reduce.md, conflicts.md, ../extractors.md, ../parser.md, full-example.md - No Rule[Int], no n.value.toDouble, no inherited attribute, no L-attributed
- Add 'Compiler Theory' subsection with 9 theory pages in pipeline order - Pages use theory/pagename.md format resolving to docs/_docs/theory/ - Order: pipeline, tokens, lexer-fa, cfg, why-lr, shift-reduce, conflicts, semantic-actions, full-example
- pipeline.md: tokens.md and lexer-fa.md sibling links no longer use theory/ prefix - pipeline.md: lexer.md and parser.md reference doc links now use ../ prefix - tokens.md: cfg.md sibling link no longer uses theory/ prefix - lexer-fa.md: cfg.md sibling link no longer uses theory/ prefix
…duce section - Inserted identical correction blockquote after the reduce/reduce compiler output block - Readers who encounter only the RR error message now learn that alwaysBefore/alwaysAfter do not exist in Alpaca API - Correct methods are before/after per conflict-resolution.md
…y pages - semantic-actions.md: replace backtick code span with functional [Parser](../parser.md) hyperlink - shift-reduce.md: add Next: [Conflicts and Disambiguation](conflicts.md) bullet to Cross-links - tokens.md: add Next: [The Lexer: Regex to Finite Automata](lexer-fa.md) bullet to Cross-links
- Line 102: n.value: Int -> n.value: Double (CalcLexer.NUMBER yields Double) - Line 117: where an Int -> where a Double (matching type) - Line 245: Rule[Int] -> Rule[Double] in conflict-resolution example
- Append See [Debug Settings](debug-settings.html) paragraph at end of lexer.md - Append See [Debug Settings](debug-settings.html) paragraph at end of parser.md
- Fix [cfg.md](cfg.md) to [Context-Free Grammars](cfg.md) on line 24 (TD-05) - Add Next: prefix to Semantic Actions bullet in conflicts.md Cross-links (TD-04) - Add Next: prefix to Full Example bullet in semantic-actions.md Cross-links (TD-04)
- Line 22: n.value: Int -> n.value: Double (CalcLexer.NUMBER yields Double) - Line 33: where an Int -> where a Double (matching type) - Line 62: Rule[Int] -> Rule[Double] (CalcLexer.NUMBER binding) - Line 67: v: Int -> v: Double (matching type annotation in comment) - Lines 28-29: single-backslash backtick names (\+, \(, \)) -> double-backslash (\+, \(, \)) to match parser.md and lexer.md Naming Table style
- Explains tokenize -> filter List[Lexeme] -> parse composition pattern - Comment-stripping example with Stage1 lexer and SumParser - Re-lexing values example with flatMap expansion pattern - Documents that Lexeme constructor is private[alpaca] - Cross-links to between-stages.html, lexer.html, parser.html
- Complete CalcLexer + CalcParser with operator precedence via before/after DSL - Rule[Double] type with n.value extractor pattern (decision [13-01]) - Full resolutions set covering +, -, *, / with correct precedence hierarchy - Key points section warns against alwaysBefore/alwaysAfter (decision [10-01]) - Cross-links to conflict-resolution.html, parser.html, lexer.html
- IndentCtx case class with currentIndent and prevIndent fields - IndentLexer with \n( *) pattern and body-condition workaround for guards - INDENT/DEDENT token emission based on indentation level change - IndentParser example reading INDENT/DEDENT tokens - Cross-links to lexer-context.html, lexer-error-recovery.html, lexer.html
- Three sections: ShadowException (compile-time), RuntimeException (runtime lex), T | Null (parser failure) - Clarifies ShadowException is compile-time only -- cannot be caught with try/catch - Guards-not-supported workaround pattern included - Notes GH #21 (no custom error handler) and GH #51/#65 (no structured parser errors) - Cross-links to lexer-error-recovery.html, lexer-context.html, parser.html
Adds a "Cookbook" nested section to sidebar.yml listing all 4 how-to pages: expression-evaluator, error-messages, multi-pass, whitespace-sensitive. Build verified: ./mill docJar 63/63 SUCCESS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Minimum allowed coverage is Generated by 🐒 cobertura-action against c706729 |
There was a problem hiding this comment.
Pull request overview
Adds a new “Cookbook” section and expands the documentation set with both cookbook/how-to guides and a multi-page compiler-theory tutorial, alongside several consistency/tech-debt cleanups across existing reference docs and navigation.
Changes:
- Added 4 cookbook pages (expression evaluator, error messages, multi-pass processing, whitespace-sensitive lexing) and integrated them into the sidebar.
- Added/expanded compiler theory tutorial pages (pipeline, tokens/lexemes, lexer FA, CFGs, LR motivation, shift-reduce, conflicts, semantic actions, full example).
- Refined reference docs with cross-links and additional guidance (e.g., debug timeout section, expanded docs link list).
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/sidebar.yml | Adds nested “Cookbook” section and links theory/tutorial pages into nav. |
| docs/_docs/getting-started.md | Adds a richer “Documentation” link list with direct page links. |
| docs/_docs/lexer.md | Reference doc for lexer DSL, regex patterns, tokens, naming rules, context intro. |
| docs/_docs/lexer-context.md | Documents LexerCtx contract, tracking traits, snapshots, and hooks. |
| docs/_docs/lexer-error-recovery.md | Documents compile-time lexer errors and runtime failure behavior. |
| docs/_docs/between-stages.md | Explains Lexeme/snapshot contract and BetweenStages behavior. |
| docs/_docs/parser.md | Reference doc for parser DSL, terminals/non-terminals, EBNF operators, conflicts. |
| docs/_docs/parser-context.md | Documents ParserCtx usage and context threading through reductions. |
| docs/_docs/extractors.md | Reference doc for terminal/non-terminal/EBNF extractors and lexeme fields. |
| docs/_docs/conflict-resolution.md | Reference/tutorial for conflict messages and before/after resolution DSL. |
| docs/_docs/debug-settings.md | Adds compile-time debug settings explanation and timeout troubleshooting. |
| docs/_docs/theory/pipeline.md | Introduces pipeline model and compile-time vs runtime boundary. |
| docs/_docs/theory/tokens.md | Defines tokens/lexemes and maps them to Alpaca types and examples. |
| docs/_docs/theory/lexer-fa.md | Explains regex → FA concepts and how Alpaca combines token patterns. |
| docs/_docs/theory/cfg.md | Introduces CFGs, derivations, parse trees, and DSL mapping. |
| docs/_docs/theory/why-lr.md | Motivates LR parsing vs LL and explains LR family choice. |
| docs/_docs/theory/shift-reduce.md | Step-by-step shift/reduce trace with LR(1) lookahead discussion. |
| docs/_docs/theory/conflicts.md | Explains shift/reduce and reduce/reduce conflicts and resolution. |
| docs/_docs/theory/semantic-actions.md | Explains semantic actions, extractor patterns, and typed results. |
| docs/_docs/theory/full-example.md | Assembles full calculator with conflict resolution and run-through. |
| docs/_docs/cookbook/expression-evaluator.md | How-to for precedence/associativity with before/after in a calculator. |
| docs/_docs/cookbook/error-messages.md | How-to for understanding compile-time vs lex-time vs parse-time failures. |
| docs/_docs/cookbook/multi-pass.md | How-to for composing multiple passes via lexeme list transformations. |
| docs/_docs/cookbook/whitespace-sensitive.md | How-to for indentation tracking and emitting INDENT/DEDENT tokens. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // result: Double | Null = 11.0 | ||
| ``` | ||
|
|
||
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result. |
There was a problem hiding this comment.
The description says CalcParser.parse “builds the parse tree internally”, but later pages explicitly state that Alpaca does not materialize a parse tree object and instead applies semantic actions immediately during reductions. Consider rephrasing to avoid implying a tree is constructed (e.g., “recognizes structure via the LR stack and evaluates semantic actions during reductions”).
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result. | |
| `CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, recognizes the grammatical structure using the LR parse table, applies your semantic actions during reductions, and returns the typed result. |
| A lexer reads a character stream from left to right and emits a token stream. Each scan step | ||
| finds the longest prefix of the remaining input that matches one of the token class patterns — | ||
| this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws | ||
| an error. The result is a flat list of lexemes that the parser consumes next. |
There was a problem hiding this comment.
This section describes tokenization as “find[ing] the longest prefix … that matches one of the token class patterns (maximal munch)”, but Alpaca’s lexer semantics are documented elsewhere as ordered patterns where the first matching pattern wins. As implemented (combined alternation regex + lookingAt), the match choice is priority-based by pattern order, not a global longest-match across all token patterns. Consider adjusting this wording to match the actual “ordered rules / first match wins” behavior.
| A lexer reads a character stream from left to right and emits a token stream. Each scan step | |
| finds the longest prefix of the remaining input that matches one of the token class patterns — | |
| this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws | |
| an error. The result is a flat list of lexemes that the parser consumes next. | |
| A lexer reads a character stream from left to right and emits a token stream. At each scan step, | |
| it tries the token class patterns in a fixed order and selects the first pattern whose regex | |
| matches a prefix of the remaining input. If no pattern matches the current position, the lexer | |
| throws an error. The result is a flat list of lexemes that the parser consumes next. |
| "minus" { case (Expr(a), CalcLexer.MINUS(_), Expr(b)) => a - b }, | ||
| "times" { case (Expr(a), CalcLexer.TIMES(_), Expr(b)) => a * b }, | ||
| "div" { case (Expr(a), CalcLexer.DIVIDE(_), Expr(b)) => a / b }, | ||
| { case (CalcLexer.`\(`(_), Expr(e), CalcLexer.`\)`(_)) => e }, |
There was a problem hiding this comment.
In the parser example, the token accessors for parentheses are written as CalcLexer.( / `CalcLexer.`\) but here they appear as CalcLexer.( with only a single backslash ((/)). That token name won’t match the lexer definition above (`case "\\(" => Token["LPAREN"]`, etc.) and conflicts with the accessor form documented elsewhere (`CalcLexer.`\\(). Update the snippet to use the correct backticked accessor names for LPAREN/RPAREN (or use CalcLexer.LPAREN/CalcLexer.RPAREN consistently).
| { case (CalcLexer.`\(`(_), Expr(e), CalcLexer.`\)`(_)) => e }, | |
| { case (CalcLexer.LPAREN(_), Expr(e), CalcLexer.RPAREN(_)) => e }, |
| case "\\n( *)" => | ||
| val newIndent = ctx.text.toString.count(_ == ' ') | ||
| val prev = ctx.prevIndent |
There was a problem hiding this comment.
ctx.text is documented elsewhere as the remaining input before the match, not the matched substring (see lexer-context/between-stages docs and Tokenization.tokenize implementation). Counting spaces via ctx.text.toString.count(_ == ' ') will therefore count spaces beyond the newline/indent segment and compute the wrong indent. Bind the match with @ (e.g., case m @ "\\n( *)" =>) and count spaces in m, or otherwise derive the indent from just the matched text.
| The `\\n( *)` pattern matches a newline followed by zero or more spaces. | ||
| `ctx.text` contains the full match text at the time the rule body runs; counting spaces in it gives the new indentation level. | ||
| `Token["INDENT"](newIndent)` and `Token["DEDENT"](newIndent)` carry the new depth as their value, which the parser can read. |
There was a problem hiding this comment.
The explanation says ctx.text contains the full match text in the lexer rule body, but LexerCtx.text is described as the remaining input before each match (with lexeme snapshots overriding text later). This mismatch is likely to confuse readers and also contradicts the code sample’s intent. Consider rewording to clarify that you should use a bound match string (via @) for the matched indentation segment, while ctx.text is the remaining input.
| object SumParser extends Parser: | ||
| val Sum: Rule[Int] = rule( | ||
| { case (Sum(a), Stage1.PLUS(_), Sum(b)) => a + b }, | ||
| { case Stage1.NUM(n) => n.value.asInstanceOf[Int] }, |
There was a problem hiding this comment.
Stage1.NUM(n) binds n as a lexeme whose .value should already be Int (from Token["NUM"](num.toInt)), so n.value.asInstanceOf[Int] is misleading and unnecessary. Prefer n.value to keep the example type-safe and consistent with the rest of the docs.
| { case Stage1.NUM(n) => n.value.asInstanceOf[Int] }, | |
| { case Stage1.NUM(n) => n.value }, |
| names: mutable.Map[String, Int] = mutable.Map.empty, | ||
| ) extends ParserCtx derives Copyable | ||
|
|
||
| object CalcParser extends Parser[CalcContext]: | ||
| val Expr: Rule[Int] = rule( | ||
| { case CalcLexer.NUMBER(n) => n.value }, | ||
| { case CalcLexer.ID(id) => ctx.names.getOrElse(id.value, 0) }, | ||
| ) | ||
| val Statement: Rule[Unit | Int] = rule( | ||
| { case (CalcLexer.ID(id), CalcLexer.ASSIGN(_), Expr(expr)) => | ||
| ctx.names(id.value) = expr }, | ||
| { case Expr(expr) => expr }, | ||
| ) | ||
| val root: Rule[Unit | Int] = rule: |
There was a problem hiding this comment.
This stateful parser example declares val Expr: Rule[Int] but returns n.value from CalcLexer.NUMBER(n). In the docs’ running CalcLexer, NUMBER is defined as num.toDouble (i.e., n.value: Double), so this example’s types don’t line up and would not typecheck if compiled. Either change the rule/result types to Double (and update names/defaults accordingly) or introduce a separate lexer example where NUMBER is an Int.
| names: mutable.Map[String, Int] = mutable.Map.empty, | |
| ) extends ParserCtx derives Copyable | |
| object CalcParser extends Parser[CalcContext]: | |
| val Expr: Rule[Int] = rule( | |
| { case CalcLexer.NUMBER(n) => n.value }, | |
| { case CalcLexer.ID(id) => ctx.names.getOrElse(id.value, 0) }, | |
| ) | |
| val Statement: Rule[Unit | Int] = rule( | |
| { case (CalcLexer.ID(id), CalcLexer.ASSIGN(_), Expr(expr)) => | |
| ctx.names(id.value) = expr }, | |
| { case Expr(expr) => expr }, | |
| ) | |
| val root: Rule[Unit | Int] = rule: | |
| names: mutable.Map[String, Double] = mutable.Map.empty, | |
| ) extends ParserCtx derives Copyable | |
| object CalcParser extends Parser[CalcContext]: | |
| val Expr: Rule[Double] = rule( | |
| { case CalcLexer.NUMBER(n) => n.value }, | |
| { case CalcLexer.ID(id) => ctx.names.getOrElse(id.value, 0.0) }, | |
| ) | |
| val Statement: Rule[Unit | Double] = rule( | |
| { case (CalcLexer.ID(id), CalcLexer.ASSIGN(_), Expr(expr)) => | |
| ctx.names(id.value) = expr }, | |
| { case Expr(expr) => expr }, | |
| ) | |
| val root: Rule[Unit | Double] = rule: |
| object CalcParser extends Parser: // uses ParserCtx.Empty by default | ||
| val Expr: Rule[Int] = rule( | ||
| { case (Expr(a), CalcLexer.PLUS(_), Expr(b)) => a + b }, | ||
| { case CalcLexer.NUMBER(n) => n.value }, | ||
| ) |
There was a problem hiding this comment.
These examples use Rule[Int] with CalcLexer.NUMBER(n) => n.value, but elsewhere in the docs the running CalcLexer defines NUMBER as Double (num.toDouble). As written, the types are inconsistent and the snippets would not typecheck if compiled. Consider switching these examples to Double throughout or explicitly defining a separate lexer whose NUMBER token carries Int.
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | ||
| through the input, no backtracking. |
There was a problem hiding this comment.
The claim that using Java’s regex engine implies “the same O(n) guarantee as a hand-built DFA” / “no backtracking” is not accurate in general: java.util.regex is a backtracking engine and can exhibit superlinear behavior for some patterns. Since Alpaca relies on Pattern + Matcher.lookingAt, it’s safer to avoid stating a strict O(n) guarantee unless the implementation enforces RE2-style constraints or similar. Consider rephrasing to a weaker guarantee (e.g., typically linear for these token patterns) or documenting the limitation.
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | |
| through the input, no backtracking. | |
| In practice, this means Alpaca's lexer behaves much like a hand-built DFA: it makes a single | |
| pass over the input using a combined pattern and, for typical token patterns, runs in time | |
| that is effectively linear in the input size. However, because it relies on Java's backtracking | |
| `java.util.regex` engine, it does not provide a formal worst-case O(n) guarantee for arbitrary | |
| regular expressions. |
| object CalcParser extends Parser[CalcContext]: | ||
| val Expr: Rule[Int] = rule( | ||
| { case CalcLexer.NUMBER(n) => n.value }, | ||
| { case CalcLexer.ID(id) => |
There was a problem hiding this comment.
Same issue as earlier in this file: val Expr: Rule[Int] returns n.value from CalcLexer.NUMBER, but the running CalcLexer in the docs yields Double. Keeping NUMBER consistently Double across docs would avoid confusing readers and prevent copy/paste type errors.
📊 Test Compilation Benchmark
Result: Current branch is 0.149s unchanged (0.34%) ℹ️ |
Summary
./mill docJar63/63 SUCCESSWhat's included
Cookbook pages (Phase 14)
docs/_docs/cookbook/expression-evaluator.md— Full CalcParser with operator precedence via before/after DSLdocs/_docs/cookbook/error-messages.md— ShadowException, RuntimeException, and null parse result handlingdocs/_docs/cookbook/multi-pass.md— Chaining lexer/parser passes, filtering lexeme streamsdocs/_docs/cookbook/whitespace-sensitive.md— LexerCtx-based INDENT/DEDENT indentation trackingTech debt fixes (Phase 13)
parser.mdandextractors.mdtype comments (Int→Doublefor CalcLexer.NUMBER)debug-settings.mdcross-links from lexer.md and parser.mdextractors.mdbacktick notation inconsistencyNext:prefix bullets toconflicts.mdandsemantic-actions.mdconflicts.mdlink text from[cfg.md](cfg.md)to[Context-Free Grammars](cfg.md)Sidebar integration (Phase 15)
sidebar.ymlwith all 4 how-to pagesTest plan
./mill docJarpasses (63/63 SUCCESS)sc:nocompileannotation (11 code blocks)import alpaca.*🤖 Generated with Claude Code