Skip to content

docs: compiler theory section with gap fixes#309

Open
halotukozak wants to merge 14 commits intomasterfrom
theory-gap-fixes
Open

docs: compiler theory section with gap fixes#309
halotukozak wants to merge 14 commits intomasterfrom
theory-gap-fixes

Conversation

@halotukozak
Copy link
Copy Markdown
Owner

Summary

  • Write theory pages: pipeline, tokens/lexemes, regex-to-automata, CFG, shift-reduce, why-LR, conflicts, semantic actions, full example
  • Add Compiler Theory nested section to sidebar
  • Fix broken cross-links and parser navigation
  • Add alwaysBefore/alwaysAfter correction note

🤖 Generated with Claude Code

- Create docs/_docs/theory/tokens.md (TH-02)
- Terminal symbols definition, token class vs instance distinction
- Formal lexeme definition as triple (T, w, pos) for CROSS-02
- CalcLexer 7-token class table with patterns and value types
- Canonical CalcLexer definition with sc:nocompile
- Tokenization output code example with sc:nocompile
- Cross-links to lexer.md and lexer-fa.md
- New docs/_docs/theory/ directory with pipeline.md as opening theory page
- Explains four compilation stages: source text, lexical analysis, syntactic analysis, semantic analysis
- Documents Alpaca's compile-time vs runtime boundary with standard callout block
- Formal definition block using Unicode math: parse ∘ tokenize : String → R
- Pipeline code example with sc:nocompile referencing CalcLexer and CalcParser
- Type mapping table: pipeline stages to Alpaca types (List[Lexeme], R | Null)
- Cross-links to lexer.md and parser.md
- No CalcParser grammar notation, no LaTeX, all macro blocks marked sc:nocompile
- Create docs/_docs/theory/lexer-fa.md (TH-03)
- Regular language formal definition block for CROSS-02
- NFA/DFA conceptual explanation with state transition table for PLUS token
- DFA 5-tuple formal definition (Q, Σ, δ, q₀, F) for CROSS-02
- Combined alternation pattern explanation grounded in Alpaca internals
- Shadowing detection via dregex subset checking
- Standard compile-time callout block
- Cross-links to lexer.md and tokens.md
…finition

- Top-down vs bottom-up parsing approaches
- Left recursion infinite-loop trace showing LL failure
- LR family comparison table: LR(0), SLR(1), LALR(1), LR(1) with Alpaca marked as LR(1)
- Why LR(1) vs LALR(1) section grounded in Item.scala/ParseTable.scala source
- LR(1) item formal definition using [A → α • β, a] dot notation with examples
- O(n) parsing paragraph
- Compile-time callout in established blockquote format
- Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, ../parser.md
…rmal configuration

- Parse stack explanation: (stateIndex, node) pairs from Parser.scala
- Parse tables section: parse table + action table with separation of concerns
- Simplified 3-production grammar block for trace clarity
- 8-row parse trace table for '1 + 2' with Stack | Remaining input | Action columns
- Annotation notes for steps 1, 2, 6, 7, 8
- Disclaimer that state numbers are illustrative for simplified grammar
- 3 LR(1) item examples with dot notation from Item.scala
- LR parse configuration formal definition in blockquote format
- Connection to Alpaca runtime loop() function prose reference
- O(n) loop termination paragraph
- Compile-time callout in established blockquote format
- Cross-links to why-lr.md, cfg.md, ../conflict-resolution.md, ../parser.md, pipeline.md
- Formal CFG 4-tuple definition (V, Σ, R, S) in blockquote format
- 7-production CalcParser BNF grammar (6 Expr productions + root)
- Leftmost derivation for 1 + 2 with ⇒ steps
- ASCII parse tree for 1 + 2
- CalcParser Alpaca DSL block annotated with sc:nocompile
- Compile-time callout in established blockquote format
- Cross-links to tokens.md, why-lr.md, ../parser.md, ../conflict-resolution.md
- Formal definition block for Parse Table Conflict (state/symbol pair collision)
- Shift/reduce conflict section with CalcParser 1+2+3 example and real Alpaca error message
- alwaysBefore/alwaysAfter discrepancy note immediately after error block
- Reduce/reduce conflict section with Integer/Float example and error message
- LR(1) lookahead disambiguation section
- Resolution by priority section with minimal sc:nocompile example (production.plus only)
- Compile-time detection section with standard blockquote callout
- Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, semantic-actions.md, full-example.md
- Six-step narrative from bare grammar to working calculator (7.0)
- CalcLexer definition, bare CalcParser with ShiftReduceConflict error
- Resolved CalcParser with all 6 resolutions using production.div (not production.divide)
- Pipeline evaluation: 1+2*3=7.0, (1+2)*3=9.0 with null-check note
- Semantic action trace for 1+2*3 showing 2*3 reduces before 1+...
- Formal definition block, compile-time callout blockquote
- Theory-to-code mapping table with cross-links to all theory pages
- Formal definition block for Semantic Action (S-attributed scheme)
- Syntax-directed translation section with synthesized attribute explanation
- Extractor pattern section with complete 7-production CalcParser action table
- No-parse-tree section grounded in Parser.scala loop() implementation
- Typed results section explaining Rule[Double] compile-time type checking
- Compile-time processing callout
- Cross-links to shift-reduce.md, conflicts.md, ../extractors.md, ../parser.md, full-example.md
- No Rule[Int], no n.value.toDouble, no inherited attribute, no L-attributed
- Add 'Compiler Theory' subsection with 9 theory pages in pipeline order
- Pages use theory/pagename.md format resolving to docs/_docs/theory/
- Order: pipeline, tokens, lexer-fa, cfg, why-lr, shift-reduce, conflicts, semantic-actions, full-example
- pipeline.md: tokens.md and lexer-fa.md sibling links no longer use theory/ prefix
- pipeline.md: lexer.md and parser.md reference doc links now use ../ prefix
- tokens.md: cfg.md sibling link no longer uses theory/ prefix
- lexer-fa.md: cfg.md sibling link no longer uses theory/ prefix
…duce section

- Inserted identical correction blockquote after the reduce/reduce compiler output block
- Readers who encounter only the RR error message now learn that alwaysBefore/alwaysAfter do not exist in Alpaca API
- Correct methods are before/after per conflict-resolution.md
…y pages

- semantic-actions.md: replace backtick code span with functional [Parser](../parser.md) hyperlink
- shift-reduce.md: add Next: [Conflicts and Disambiguation](conflicts.md) bullet to Cross-links
- tokens.md: add Next: [The Lexer: Regex to Finite Automata](lexer-fa.md) bullet to Cross-links
Copilot AI review requested due to automatic review settings March 4, 2026 14:50
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 4, 2026

🏃 Runtime Benchmark

Benchmark Base (master) Current (theory-gap-fixes) Diff

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

@@           Coverage Diff           @@
##           master     #309   +/-   ##
=======================================
  Coverage   42.03%   42.03%           
=======================================
  Files          35       35           
  Lines         433      433           
=======================================
  Hits          182      182           
  Misses        251      251           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “Compiler Theory” documentation section that explains Alpaca’s compilation pipeline and LR parsing model, and wires it into the docs sidebar to improve navigation and cross-linking across the parser/lexer docs.

Changes:

  • Add “Compiler Theory” nested section to docs/sidebar.yml.
  • Introduce a set of new theory pages (pipeline, tokens/lexemes, regex→automata, CFGs, LR motivation, shift-reduce, conflicts, semantic actions, full worked example).
  • Add/update cross-links between theory pages and existing reference docs (parser, lexer, conflict resolution, extractors).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
docs/sidebar.yml Adds the new “Compiler Theory” section and links to the new pages.
docs/_docs/theory/pipeline.md Introduces the compilation pipeline framing and compile-time vs runtime boundary.
docs/_docs/theory/tokens.md Defines tokens/lexemes and relates them to Alpaca’s lexer output types.
docs/_docs/theory/lexer-fa.md Explains regex→automata and how Alpaca’s lexer implementation relates to that theory.
docs/_docs/theory/cfg.md Introduces CFGs and maps the calculator grammar to Alpaca’s rule DSL.
docs/_docs/theory/why-lr.md Motivates LR parsing vs LL and explains LR family variants.
docs/_docs/theory/shift-reduce.md Walks through the shift/reduce loop with a concrete parse trace.
docs/_docs/theory/conflicts.md Explains shift/reduce and reduce/reduce conflicts and Alpaca’s resolution DSL.
docs/_docs/theory/semantic-actions.md Explains syntax-directed translation and Alpaca’s semantic action model.
docs/_docs/theory/full-example.md Provides an end-to-end calculator example including conflict resolution and evaluation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +82 to +83
This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The claim here that the lexer has the same O(n) guarantee as a hand-built DFA and does “no backtracking” isn’t accurate with java.util.regex.Pattern (which is primarily a backtracking engine and can be super-linear for some regexes). Please soften/correct this to avoid promising DFA-like worst-case performance unless Alpaca enforces a safe regex subset.

Suggested change
This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
This means Alpaca's lexer processes the input in a single left-to-right pass using a combined
regex. For typical lexer-style patterns this behaves similarly to a DFA-based lexer, but the
actual worst-case performance is determined by `java.util.regex.Pattern` and the specific
regexes you use, which may involve backtracking.

Copilot uses AI. Check for mistakes.
|-----------|-------------------|-------------|-------|
| LR(0) | None (reduce always) | Smallest | Too weak for most real grammars |
| SLR(1) | FOLLOW sets (global per non-terminal) | Same as LR(0) | Better, still limited |
| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) |
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANTLR is listed here as an LALR(1) parser generator, but ANTLR is a top-down LL(*)-style parser generator (not LALR(1)). Please remove ANTLR from this example list or replace it with an actual LALR(1) tool.

Suggested change
| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) |
| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison) |

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +40
position in the source. Parsing `"3 + 4"` produces three lexemes:

- `NUMBER("3", pos=0)`
- `PLUS("+", pos=2)`
- `NUMBER("4", pos=4)`

The word *lexeme* is used throughout this documentation to mean this complete record.

> **Definition — Lexeme:**
> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string
> (a member of the language defined by T's regex), and pos is the position of the end of the
> match in the source text.
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These example lexemes use pos as a 0-based absolute offset (pos=0/2/4). Alpaca’s built-in LexerCtx.Default instead tracks position as a 1-based column position (and lexeme fields depend on the chosen context). Please adjust the example field name/values or label it as a conceptual example rather than Alpaca’s concrete fields.

Suggested change
position in the source. Parsing `"3 + 4"` produces three lexemes:
- `NUMBER("3", pos=0)`
- `PLUS("+", pos=2)`
- `NUMBER("4", pos=4)`
The word *lexeme* is used throughout this documentation to mean this complete record.
> **Definition — Lexeme:**
> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string
> (a member of the language defined by T's regex), and pos is the position of the end of the
> match in the source text.
position in the source. With Alpaca’s default `LexerCtx.Default`, parsing `"3 + 4"` produces
three lexemes whose `position` field is the 1-based column at the end of each token:
- `NUMBER("3", position=1)`
- `PLUS("+", position=3)`
- `NUMBER("4", position=5)`
The word *lexeme* is used throughout this documentation to mean this complete record.
> **Definition — Lexeme (conceptual):**
> Conceptually, a *lexeme* is a triple (T, w, position) where T is a token class, w ∈ L(T) is
> the matched string (a member of the language defined by T's regex), and `position` is the
> location of the end of the match in the source text, according to whatever coordinate system
> the lexer context uses (for example, column within the current line).

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +53
- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
for PLUS
- `position` — the character offset at the end of the match
- `line` — the line number at the end of the match
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These bullets describe position as a “character offset at the end of the match”. In Alpaca’s default lexer context, position is a 1-based column position within the current line (and lexeme fields depend on the chosen LexerCtx). Please adjust this wording so it matches the actual position semantics and doesn’t imply a fixed, always-present absolute offset.

Copilot uses AI. Check for mistakes.

The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input.

Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page defines a 4-stage pipeline and says CalcParser.parse handles stages 3–4, but later says “Alpaca covers stages 1–3 of the classical pipeline.” That’s contradictory given the earlier definition. Please make the stage numbering consistent (either Alpaca covers 1–4 in this tutorial framing, or redefine which stages Alpaca implements).

Suggested change
Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +42
> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string
> (a member of the language defined by T's regex), and pos is the position of the end of the
> match in the source text.
> In Alpaca: `Lexeme[Name, Value]` where `Name` is the token class name (a string literal type)
> and `Value` is the Scala type of the extracted value.
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition describes pos as “the position of the end of the match in the source text.” In Alpaca, position/line come from the lexer context (e.g., PositionTracking/LineTracking) and are column/line oriented rather than a single absolute offset, and custom contexts can change what’s tracked. Please refine this definition so it matches Alpaca’s actual position model (or clearly label it as a language-theory definition separate from Alpaca’s fields).

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +8
A lexer reads a character stream from left to right and emits a token stream. Each scan step
finds the longest prefix of the remaining input that matches one of the token class patterns —
this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
an error. The result is a flat list of lexemes that the parser consumes next.
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page says the lexer uses the “maximal munch” (longest-prefix) rule across token classes. Alpaca’s implementation uses a single ordered Java-regex alternation and effectively picks the first alternative that matches at the current position, so it is not guaranteed to choose the longest match among all token patterns. Please reword this section to describe the actual ordered “first match wins” behavior (and how to order rules to get the desired result).

Suggested change
A lexer reads a character stream from left to right and emits a token stream. Each scan step
finds the longest prefix of the remaining input that matches one of the token class patterns —
this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
an error. The result is a flat list of lexemes that the parser consumes next.
A lexer reads a character stream from left to right and emits a token stream. At each scan step,
it tries the token class patterns in their declared order and picks the first pattern whose regex
matches a prefix of the remaining input — this is an ordered “first match wins” rule. To prefer
longer or more specific tokens over shorter prefixes, place their rules earlier in the list. When
no pattern matches the current position, the lexer throws an error. The result is a flat list of
lexemes that the parser consumes next.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 4, 2026

📊 Test Compilation Benchmark

Branch Average Time
Base (master) 49.990s
Current (theory-gap-fixes) 50.610s

Result: Current branch is 0.620s slower (1.24%) ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants