docs: compiler theory section with gap fixes by halotukozak · Pull Request #309 · halotukozak/alpaca

halotukozak · 2026-03-04T14:50:38Z

Summary

Write theory pages: pipeline, tokens/lexemes, regex-to-automata, CFG, shift-reduce, why-LR, conflicts, semantic actions, full example
Add Compiler Theory nested section to sidebar
Fix broken cross-links and parser navigation
Add alwaysBefore/alwaysAfter correction note

🤖 Generated with Claude Code

- Create docs/_docs/theory/tokens.md (TH-02) - Terminal symbols definition, token class vs instance distinction - Formal lexeme definition as triple (T, w, pos) for CROSS-02 - CalcLexer 7-token class table with patterns and value types - Canonical CalcLexer definition with sc:nocompile - Tokenization output code example with sc:nocompile - Cross-links to lexer.md and lexer-fa.md

- New docs/_docs/theory/ directory with pipeline.md as opening theory page - Explains four compilation stages: source text, lexical analysis, syntactic analysis, semantic analysis - Documents Alpaca's compile-time vs runtime boundary with standard callout block - Formal definition block using Unicode math: parse ∘ tokenize : String → R - Pipeline code example with sc:nocompile referencing CalcLexer and CalcParser - Type mapping table: pipeline stages to Alpaca types (List[Lexeme], R | Null) - Cross-links to lexer.md and parser.md - No CalcParser grammar notation, no LaTeX, all macro blocks marked sc:nocompile

- Create docs/_docs/theory/lexer-fa.md (TH-03) - Regular language formal definition block for CROSS-02 - NFA/DFA conceptual explanation with state transition table for PLUS token - DFA 5-tuple formal definition (Q, Σ, δ, q₀, F) for CROSS-02 - Combined alternation pattern explanation grounded in Alpaca internals - Shadowing detection via dregex subset checking - Standard compile-time callout block - Cross-links to lexer.md and tokens.md

…finition - Top-down vs bottom-up parsing approaches - Left recursion infinite-loop trace showing LL failure - LR family comparison table: LR(0), SLR(1), LALR(1), LR(1) with Alpaca marked as LR(1) - Why LR(1) vs LALR(1) section grounded in Item.scala/ParseTable.scala source - LR(1) item formal definition using [A → α • β, a] dot notation with examples - O(n) parsing paragraph - Compile-time callout in established blockquote format - Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, ../parser.md

…rmal configuration - Parse stack explanation: (stateIndex, node) pairs from Parser.scala - Parse tables section: parse table + action table with separation of concerns - Simplified 3-production grammar block for trace clarity - 8-row parse trace table for '1 + 2' with Stack | Remaining input | Action columns - Annotation notes for steps 1, 2, 6, 7, 8 - Disclaimer that state numbers are illustrative for simplified grammar - 3 LR(1) item examples with dot notation from Item.scala - LR parse configuration formal definition in blockquote format - Connection to Alpaca runtime loop() function prose reference - O(n) loop termination paragraph - Compile-time callout in established blockquote format - Cross-links to why-lr.md, cfg.md, ../conflict-resolution.md, ../parser.md, pipeline.md

- Formal CFG 4-tuple definition (V, Σ, R, S) in blockquote format - 7-production CalcParser BNF grammar (6 Expr productions + root) - Leftmost derivation for 1 + 2 with ⇒ steps - ASCII parse tree for 1 + 2 - CalcParser Alpaca DSL block annotated with sc:nocompile - Compile-time callout in established blockquote format - Cross-links to tokens.md, why-lr.md, ../parser.md, ../conflict-resolution.md

- Formal definition block for Parse Table Conflict (state/symbol pair collision) - Shift/reduce conflict section with CalcParser 1+2+3 example and real Alpaca error message - alwaysBefore/alwaysAfter discrepancy note immediately after error block - Reduce/reduce conflict section with Integer/Float example and error message - LR(1) lookahead disambiguation section - Resolution by priority section with minimal sc:nocompile example (production.plus only) - Compile-time detection section with standard blockquote callout - Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, semantic-actions.md, full-example.md

- Six-step narrative from bare grammar to working calculator (7.0) - CalcLexer definition, bare CalcParser with ShiftReduceConflict error - Resolved CalcParser with all 6 resolutions using production.div (not production.divide) - Pipeline evaluation: 1+2*3=7.0, (1+2)*3=9.0 with null-check note - Semantic action trace for 1+2*3 showing 2*3 reduces before 1+... - Formal definition block, compile-time callout blockquote - Theory-to-code mapping table with cross-links to all theory pages

- Formal definition block for Semantic Action (S-attributed scheme) - Syntax-directed translation section with synthesized attribute explanation - Extractor pattern section with complete 7-production CalcParser action table - No-parse-tree section grounded in Parser.scala loop() implementation - Typed results section explaining Rule[Double] compile-time type checking - Compile-time processing callout - Cross-links to shift-reduce.md, conflicts.md, ../extractors.md, ../parser.md, full-example.md - No Rule[Int], no n.value.toDouble, no inherited attribute, no L-attributed

- Add 'Compiler Theory' subsection with 9 theory pages in pipeline order - Pages use theory/pagename.md format resolving to docs/_docs/theory/ - Order: pipeline, tokens, lexer-fa, cfg, why-lr, shift-reduce, conflicts, semantic-actions, full-example

- pipeline.md: tokens.md and lexer-fa.md sibling links no longer use theory/ prefix - pipeline.md: lexer.md and parser.md reference doc links now use ../ prefix - tokens.md: cfg.md sibling link no longer uses theory/ prefix - lexer-fa.md: cfg.md sibling link no longer uses theory/ prefix

…duce section - Inserted identical correction blockquote after the reduce/reduce compiler output block - Readers who encounter only the RR error message now learn that alwaysBefore/alwaysAfter do not exist in Alpaca API - Correct methods are before/after per conflict-resolution.md

…y pages - semantic-actions.md: replace backtick code span with functional [Parser](../parser.md) hyperlink - shift-reduce.md: add Next: [Conflicts and Disambiguation](conflicts.md) bullet to Cross-links - tokens.md: add Next: [The Lexer: Regex to Finite Automata](lexer-fa.md) bullet to Cross-links

github-actions · 2026-03-04T14:53:55Z

🏃 Runtime Benchmark

Benchmark	Base (master)	Current (theory-gap-fixes)	Diff

codecov · 2026-03-04T14:55:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

@@           Coverage Diff           @@
##           master     #309   +/-   ##
=======================================
  Coverage   42.03%   42.03%           
=======================================
  Files          35       35           
  Lines         433      433           
=======================================
  Hits          182      182           
  Misses        251      251

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Adds a new “Compiler Theory” documentation section that explains Alpaca’s compilation pipeline and LR parsing model, and wires it into the docs sidebar to improve navigation and cross-linking across the parser/lexer docs.

Changes:

Add “Compiler Theory” nested section to docs/sidebar.yml.
Introduce a set of new theory pages (pipeline, tokens/lexemes, regex→automata, CFGs, LR motivation, shift-reduce, conflicts, semantic actions, full worked example).
Add/update cross-links between theory pages and existing reference docs (parser, lexer, conflict resolution, extractors).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
docs/sidebar.yml	Adds the new “Compiler Theory” section and links to the new pages.
docs/_docs/theory/pipeline.md	Introduces the compilation pipeline framing and compile-time vs runtime boundary.
docs/_docs/theory/tokens.md	Defines tokens/lexemes and relates them to Alpaca’s lexer output types.
docs/_docs/theory/lexer-fa.md	Explains regex→automata and how Alpaca’s lexer implementation relates to that theory.
docs/_docs/theory/cfg.md	Introduces CFGs and maps the calculator grammar to Alpaca’s `rule` DSL.
docs/_docs/theory/why-lr.md	Motivates LR parsing vs LL and explains LR family variants.
docs/_docs/theory/shift-reduce.md	Walks through the shift/reduce loop with a concrete parse trace.
docs/_docs/theory/conflicts.md	Explains shift/reduce and reduce/reduce conflicts and Alpaca’s resolution DSL.
docs/_docs/theory/semantic-actions.md	Explains syntax-directed translation and Alpaca’s semantic action model.
docs/_docs/theory/full-example.md	Provides an end-to-end calculator example including conflict resolution and evaluation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T15:03:25Z

docs/_docs/theory/lexer-fa.md

+This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
+through the input, no backtracking.


The claim here that the lexer has the same O(n) guarantee as a hand-built DFA and does “no backtracking” isn’t accurate with java.util.regex.Pattern (which is primarily a backtracking engine and can be super-linear for some regexes). Please soften/correct this to avoid promising DFA-like worst-case performance unless Alpaca enforces a safe regex subset.

Suggested change

This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass

through the input, no backtracking.

This means Alpaca's lexer processes the input in a single left-to-right pass using a combined

regex. For typical lexer-style patterns this behaves similarly to a DFA-based lexer, but the

actual worst-case performance is determined by `java.util.regex.Pattern` and the specific

regexes you use, which may involve backtracking.

Copilot · 2026-03-04T15:03:26Z

docs/_docs/theory/why-lr.md

+|-----------|-------------------|-------------|-------|
+| LR(0) | None (reduce always) | Smallest | Too weak for most real grammars |
+| SLR(1) | FOLLOW sets (global per non-terminal) | Same as LR(0) | Better, still limited |
+| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) |


ANTLR is listed here as an LALR(1) parser generator, but ANTLR is a top-down LL(*)-style parser generator (not LALR(1)). Please remove ANTLR from this example list or replace it with an actual LALR(1) tool.

Suggested change

| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) |

| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison) |

Copilot · 2026-03-04T15:03:26Z

docs/_docs/theory/tokens.md

+position in the source. Parsing `"3 + 4"` produces three lexemes:
+
+- `NUMBER("3", pos=0)`
+- `PLUS("+", pos=2)`
+- `NUMBER("4", pos=4)`
+
+The word *lexeme* is used throughout this documentation to mean this complete record.
+
+> **Definition — Lexeme:**
+> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string
+> (a member of the language defined by T's regex), and pos is the position of the end of the
+> match in the source text.


These example lexemes use pos as a 0-based absolute offset (pos=0/2/4). Alpaca’s built-in LexerCtx.Default instead tracks position as a 1-based column position (and lexeme fields depend on the chosen context). Please adjust the example field name/values or label it as a conceptual example rather than Alpaca’s concrete fields.

Suggested change

position in the source. Parsing `"3 + 4"` produces three lexemes:

- `NUMBER("3", pos=0)`

- `PLUS("+", pos=2)`

- `NUMBER("4", pos=4)`

The word *lexeme* is used throughout this documentation to mean this complete record.

> **Definition — Lexeme:**

> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string

> (a member of the language defined by T's regex), and pos is the position of the end of the

> match in the source text.

position in the source. With Alpaca’s default `LexerCtx.Default`, parsing `"3 + 4"` produces

three lexemes whose `position` field is the 1-based column at the end of each token:

- `NUMBER("3", position=1)`

- `PLUS("+", position=3)`

- `NUMBER("4", position=5)`

The word *lexeme* is used throughout this documentation to mean this complete record.

> **Definition — Lexeme (conceptual):**

> Conceptually, a *lexeme* is a triple (T, w, position) where T is a token class, w ∈ L(T) is

> the matched string (a member of the language defined by T's regex), and `position` is the

> location of the end of the match in the source text, according to whatever coordinate system

> the lexer context uses (for example, column within the current line).

Copilot · 2026-03-04T15:03:26Z

docs/_docs/theory/tokens.md

+- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
+- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
+  for PLUS
+- `position` — the character offset at the end of the match
+- `line` — the line number at the end of the match


These bullets describe position as a “character offset at the end of the match”. In Alpaca’s default lexer context, position is a 1-based column position within the current line (and lexeme fields depend on the chosen LexerCtx). Please adjust this wording so it matches the actual position semantics and doesn’t imply a fixed, always-present absolute offset.

Copilot · 2026-03-04T15:03:27Z

docs/_docs/theory/pipeline.md

+
+The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input.
+
+Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.


This page defines a 4-stage pipeline and says CalcParser.parse handles stages 3–4, but later says “Alpaca covers stages 1–3 of the classical pipeline.” That’s contradictory given the earlier definition. Please make the stage numbering consistent (either Alpaca covers 1–4 in this tutorial framing, or redefine which stages Alpaca implements).

Suggested change

Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

Copilot · 2026-03-04T15:03:27Z

docs/_docs/theory/tokens.md

+> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string
+> (a member of the language defined by T's regex), and pos is the position of the end of the
+> match in the source text.
+> In Alpaca: `Lexeme[Name, Value]` where `Name` is the token class name (a string literal type)
+> and `Value` is the Scala type of the extracted value.


The definition describes pos as “the position of the end of the match in the source text.” In Alpaca, position/line come from the lexer context (e.g., PositionTracking/LineTracking) and are column/line oriented rather than a single absolute offset, and custom contexts can change what’s tracked. Please refine this definition so it matches Alpaca’s actual position model (or clearly label it as a language-theory definition separate from Alpaca’s fields).

Copilot · 2026-03-04T15:03:27Z

docs/_docs/theory/lexer-fa.md

+A lexer reads a character stream from left to right and emits a token stream. Each scan step
+finds the longest prefix of the remaining input that matches one of the token class patterns —
+this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
+an error. The result is a flat list of lexemes that the parser consumes next.


This page says the lexer uses the “maximal munch” (longest-prefix) rule across token classes. Alpaca’s implementation uses a single ordered Java-regex alternation and effectively picks the first alternative that matches at the current position, so it is not guaranteed to choose the longest match among all token patterns. Please reword this section to describe the actual ordered “first match wins” behavior (and how to order rules to get the desired result).

Suggested change

A lexer reads a character stream from left to right and emits a token stream. Each scan step

finds the longest prefix of the remaining input that matches one of the token class patterns —

this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws

an error. The result is a flat list of lexemes that the parser consumes next.

A lexer reads a character stream from left to right and emits a token stream. At each scan step,

it tries the token class patterns in their declared order and picks the first pattern whose regex

matches a prefix of the remaining input — this is an ordered “first match wins” rule. To prefer

longer or more specific tokens over shorter prefixes, place their rules earlier in the list. When

no pattern matches the current position, the lexer throws an error. The result is a flat list of

lexemes that the parser consumes next.

github-actions · 2026-03-04T15:04:22Z

📊 Test Compilation Benchmark

Branch	Average Time
Base (master)	49.990s
Current (theory-gap-fixes)	50.610s

Result: Current branch is 0.620s slower (1.24%) ⚠️

halotukozak added 13 commits February 20, 2026 09:50

Copilot AI review requested due to automatic review settings March 4, 2026 14:50

github-actions bot added the documentation Improvements or additions to documentation label Mar 4, 2026

Merge branch 'master' into theory-gap-fixes

44c24ee

Copilot started reviewing on behalf of halotukozak March 4, 2026 14:52 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: compiler theory section with gap fixes#309

docs: compiler theory section with gap fixes#309
halotukozak wants to merge 14 commits intomasterfrom
theory-gap-fixes

halotukozak commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
		through the input, no backtracking.

-This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
-through the input, no backtracking.
+This means Alpaca's lexer processes the input in a single left-to-right pass using a combined
+regex. For typical lexer-style patterns this behaves similarly to a DFA-based lexer, but the
+actual worst-case performance is determined by `java.util.regex.Pattern` and the specific
+regexes you use, which may involve backtracking.

	\| LALR(1) \| Per-state lookahead (merged item-set cores) \| Same as LR(0)/SLR \| Most common in practice (yacc, Bison, ANTLR) \|
	\| LALR(1) \| Per-state lookahead (merged item-set cores) \| Same as LR(0)/SLR \| Most common in practice (yacc, Bison) \|


		The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input.

		Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

	Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
	Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

Conversation

halotukozak commented Mar 4, 2026

Summary

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🏃 Runtime Benchmark

Uh oh!

codecov bot commented Mar 4, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Test Compilation Benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 4, 2026 •

edited

Loading

github-actions bot commented Mar 4, 2026 •

edited

Loading