Skip to content

Theory application#288

Open
halotukozak wants to merge 6 commits intomasterfrom
theory-application
Open

Theory application#288
halotukozak wants to merge 6 commits intomasterfrom
theory-application

Conversation

@halotukozak
Copy link
Copy Markdown
Owner

@halotukozak halotukozak commented Mar 3, 2026

Summary

Add compiler theory foundation pages to the documentation, covering formal CS concepts that underpin Alpaca's design.

Pages added

  • theory/pipeline.md — The Compilation Pipeline: source → tokens → parse tree → typed result, Alpaca's compile-time macro vs runtime split
  • theory/tokens.md — Tokens & Lexemes: formal token definition, lexeme vs token distinction, token stream examples
  • theory/finite-automata.md — Regex to Finite Automata: how lexer patterns compile to DFA/NFA, Thompson's construction overview
  • theory/cfg.md — Context-Free Grammars: formal 4-tuple G = (V, Σ, R, S), BNF notation, derivation examples
  • theory/why-lr.md — Why LR Parsing: LR vs LL comparison, LR(1) family overview, formal LR item definition
  • theory/shift-reduce.md — Shift-Reduce Parsing: LR parse stack mechanics, 8-step trace walkthrough, formal configuration notation

Also includes

  • Compile-time mental model callouts added to 5 existing reference pages
  • Error catalog gaps closed (InconsistentConflictResolution, AlpacaTimeoutException)
  • CalcParser running example fix to use correct token names

🤖 Generated with Claude Code

- Create docs/_docs/theory/tokens.md (TH-02)
- Terminal symbols definition, token class vs instance distinction
- Formal lexeme definition as triple (T, w, pos) for CROSS-02
- CalcLexer 7-token class table with patterns and value types
- Canonical CalcLexer definition with sc:nocompile
- Tokenization output code example with sc:nocompile
- Cross-links to lexer.md and lexer-fa.md
- New docs/_docs/theory/ directory with pipeline.md as opening theory page
- Explains four compilation stages: source text, lexical analysis, syntactic analysis, semantic analysis
- Documents Alpaca's compile-time vs runtime boundary with standard callout block
- Formal definition block using Unicode math: parse ∘ tokenize : String → R
- Pipeline code example with sc:nocompile referencing CalcLexer and CalcParser
- Type mapping table: pipeline stages to Alpaca types (List[Lexeme], R | Null)
- Cross-links to lexer.md and parser.md
- No CalcParser grammar notation, no LaTeX, all macro blocks marked sc:nocompile
- Create docs/_docs/theory/lexer-fa.md (TH-03)
- Regular language formal definition block for CROSS-02
- NFA/DFA conceptual explanation with state transition table for PLUS token
- DFA 5-tuple formal definition (Q, Σ, δ, q₀, F) for CROSS-02
- Combined alternation pattern explanation grounded in Alpaca internals
- Shadowing detection via dregex subset checking
- Standard compile-time callout block
- Cross-links to lexer.md and tokens.md
…finition

- Top-down vs bottom-up parsing approaches
- Left recursion infinite-loop trace showing LL failure
- LR family comparison table: LR(0), SLR(1), LALR(1), LR(1) with Alpaca marked as LR(1)
- Why LR(1) vs LALR(1) section grounded in Item.scala/ParseTable.scala source
- LR(1) item formal definition using [A → α • β, a] dot notation with examples
- O(n) parsing paragraph
- Compile-time callout in established blockquote format
- Cross-links to cfg.md, shift-reduce.md, ../conflict-resolution.md, ../parser.md
…rmal configuration

- Parse stack explanation: (stateIndex, node) pairs from Parser.scala
- Parse tables section: parse table + action table with separation of concerns
- Simplified 3-production grammar block for trace clarity
- 8-row parse trace table for '1 + 2' with Stack | Remaining input | Action columns
- Annotation notes for steps 1, 2, 6, 7, 8
- Disclaimer that state numbers are illustrative for simplified grammar
- 3 LR(1) item examples with dot notation from Item.scala
- LR parse configuration formal definition in blockquote format
- Connection to Alpaca runtime loop() function prose reference
- O(n) loop termination paragraph
- Compile-time callout in established blockquote format
- Cross-links to why-lr.md, cfg.md, ../conflict-resolution.md, ../parser.md, pipeline.md
- Formal CFG 4-tuple definition (V, Σ, R, S) in blockquote format
- 7-production CalcParser BNF grammar (6 Expr productions + root)
- Leftmost derivation for 1 + 2 with ⇒ steps
- ASCII parse tree for 1 + 2
- CalcParser Alpaca DSL block annotated with sc:nocompile
- Compile-time callout in established blockquote format
- Cross-links to tokens.md, why-lr.md, ../parser.md, ../conflict-resolution.md
@halotukozak halotukozak self-assigned this Mar 3, 2026
Copilot AI review requested due to automatic review settings March 3, 2026 19:54
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

File Coverage
All files 32%
alpaca/lexer.scala 0%
alpaca/parser.scala 0%
alpaca/lexer.scala 93%
alpaca/internal/Showable.scala 67%
alpaca/internal/NEL.scala 44%
alpaca/internal/DebugPosition.scala 0%
alpaca/internal/Csv.scala 0%
alpaca/internal/errors.scala 0%
alpaca/internal/logger.scala 0%
alpaca/internal/internal.scala 87%
alpaca/internal/Showable.scala 70%
alpaca/internal/logger.scala 17%
alpaca/internal/AlpacaException.scala 0%
alpaca/internal/logger.scala 0%
alpaca/internal/Default.scala 0%
alpaca/internal/lexer/LazyReader.scala 95%
alpaca/internal/lexer/Tokenization.scala 94%
alpaca/internal/lexer/Lexeme.scala 83%
alpaca/internal/parser/ParseAction.scala 62%
alpaca/internal/parser/Symbol.scala 0%
alpaca/internal/parser/Production.scala 0%
alpaca/internal/parser/Item.scala 0%
alpaca/internal/parser/ConflictException.scala 0%
alpaca/internal/parser/ConflictException.scala 0%
alpaca/internal/parser/State.scala 0%
alpaca/internal/parser/ConflictException.scala 0%
alpaca/internal/parser/ParseTable.scala 4%
alpaca/internal/parser/ParserExtractors.scala 0%
alpaca/internal/parser/ConflictResolution.scala 0%
alpaca/internal/parser/FirstSet.scala 76%
alpaca/internal/parser/Symbol.scala 3%
alpaca/internal/parser/ParserExtractors.scala 50%
alpaca/internal/parser/ConflictResolution.scala 0%
alpaca/internal/parser/ParseAction.scala 0%
alpaca/internal/parser/ConflictResolution.scala 0%

Minimum allowed coverage is 0%

Generated by 🐒 cobertura-action against 6ac2409

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

🏃 Runtime Benchmark

Benchmark Base (master) Current (theory-application) Diff
parseComplexJson 0.181 ms/op 0.181 ms/op -0.000 (-0.2%) ℹ️
parseSimpleJson 0.118 ms/op 0.124 ms/op +0.005 (+4.4%) ℹ️
tokenizeComplexJson 0.030 ms/op 0.030 ms/op -0.000 (-0.2%) ℹ️
tokenizeSimpleJson 0.007 ms/op 0.007 ms/op -0.000 (-2.0%) ℹ️

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

@@            Coverage Diff            @@
##             master     #288   +/-   ##
=========================================
  Coverage          ?   37.37%           
=========================================
  Files             ?       34           
  Lines             ?      404           
  Branches          ?        0           
=========================================
  Hits              ?      151           
  Misses            ?      253           
  Partials          ?        0           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a set of “Compiler Theory Tutorial” documentation pages explaining Alpaca’s compilation pipeline, lexing concepts, CFGs, and LR parsing (including shift/reduce mechanics and LR(1) vs LALR(1)).

Changes:

  • Introduces new theory docs for tokens/lexemes, the pipeline, CFG basics, and lexer → automata concepts.
  • Adds LR-focused pages explaining why LR is used and a step-by-step shift/reduce trace.
  • Adds cross-links between the new pages and existing reference docs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
docs/_docs/theory/why-lr.md Explains LL vs LR, left recursion, and why Alpaca uses full LR(1).
docs/_docs/theory/tokens.md Defines tokens, token classes vs instances, and lexemes with CalcLexer examples.
docs/_docs/theory/shift-reduce.md Walkthrough of the shift/reduce loop with a concrete parse trace.
docs/_docs/theory/pipeline.md Describes the Alpaca compilation pipeline and compile-time vs runtime boundary.
docs/_docs/theory/lexer-fa.md Explains regex → FA theory and relates it to Alpaca’s lexer implementation.
docs/_docs/theory/cfg.md Introduces CFGs and maps the calculator grammar to Alpaca’s parser DSL.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +28 to +54
**Lexeme** — the full record of a token instance: the token class, the matched text, and its
position in the source. Parsing `"3 + 4"` produces three lexemes:

- `NUMBER("3", pos=0)`
- `PLUS("+", pos=2)`
- `NUMBER("4", pos=4)`

The word *lexeme* is used throughout this documentation to mean this complete record.

> **Definition — Lexeme:**
> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string
> (a member of the language defined by T's regex), and pos is the position of the end of the
> match in the source text.
> In Alpaca: `Lexeme[Name, Value]` where `Name` is the token class name (a string literal type)
> and `Value` is the Scala type of the extracted value.

## Alpaca's Lexeme Type

In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
pieces of information:

- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
for PLUS
- `position` — the character offset at the end of the match
- `line` — the line number at the end of the match

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section describes pos/position as an absolute character offset in the whole source and implies every lexeme always has position/line. In the implementation, LexerCtx.Default.position is a 1-based column within the current line (reset on newline), and lexeme fields depend on the chosen lexer context (e.g., LexerCtx.Empty has no line/position). Please adjust the definition and examples to match the actual semantics (line + column-at-end-of-match, when the context tracks them).

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +76
| Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) |
| Semantic analysis | parse tree | typed result | `R \| Null` (your root type) |

The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stage mapping and following paragraph continue to describe the syntactic stage as producing a "parse tree (internal)". This conflicts with the actual parser runtime, which only maintains an LR stack and computes results on reductions without constructing a tree object. Consider describing this as a conceptual parse structure or as "LR(1) stack + reductions" rather than an internal parse tree.

Suggested change
| Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) |
| Semantic analysis | parse tree | typed result | `R \| Null` (your root type) |
The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree.
| Syntactic analysis | `List[Lexeme]` | conceptual parse structure (via LR(1) stack + reductions) | LR(1) stack + reductions (internal) |
| Semantic analysis | conceptual parse structure | typed result | `R \| Null` (your root type) |
Alpaca never constructs or returns an explicit parse tree object. Instead, it uses an LR(1) stack and applies your semantic actions (the `=>` expressions in `rule` definitions) on each reduction, so what you get back from `parse` is the final typed value, not an intermediate tree.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +8
A lexer reads a character stream from left to right and emits a token stream. Each scan step
finds the longest prefix of the remaining input that matches one of the token class patterns —
this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
an error. The result is a flat list of lexemes that the parser consumes next.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lexer implementation does not enforce a global "maximal munch" across token patterns; it matches the first alternative that succeeds in the combined ordered regex (rule-order priority), not necessarily the longest among all token classes. Please update this description to match the actual "patterns are tried in order; first match wins" behavior.

Suggested change
A lexer reads a character stream from left to right and emits a token stream. Each scan step
finds the longest prefix of the remaining input that matches one of the token class patterns —
this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
an error. The result is a flat list of lexemes that the parser consumes next.
A lexer reads a character stream from left to right and emits a token stream. At each position,
it tries the token class patterns in their specified order and picks the first one whose regex
matches a prefix of the remaining input — patterns are tried in order; first match wins. When no
pattern matches the current position, the lexer throws an error. The result is a flat list of
lexemes that the parser consumes next.

Copilot uses AI. Check for mistakes.

- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cross-link theory/cfg.md is incorrect from within docs/_docs/theory/lexer-fa.md (it resolves to .../theory/theory/cfg.md). It should link to cfg.md (same directory).

Suggested change
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
- Next: [Context-Free Grammars](cfg.md) for how token streams are parsed.

Copilot uses AI. Check for mistakes.
(1.0) (2.0)
```

Note: In Alpaca, the parse tree is never exposed to user code. The `Parser` macro builds it internally during the shift-reduce parse, and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each node is reduced. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree object. (See [The Compilation Pipeline](pipeline.md) for the full picture.)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note says the Parser macro "builds" a parse tree internally. The runtime implementation performs shift/reduce and immediately evaluates semantic actions; it does not construct or retain a parse tree object (see src/alpaca/internal/parser/Parser.scala). Please rephrase to avoid contradicting the shift/reduce page (which states no parse tree is constructed).

Suggested change
Note: In Alpaca, the parse tree is never exposed to user code. The `Parser` macro builds it internally during the shift-reduce parse, and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each node is reduced. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree object. (See [The Compilation Pipeline](pipeline.md) for the full picture.)
Note: In Alpaca, parse trees are a conceptual model only; the runtime LR parser does not construct or retain an explicit parse-tree object. During shift–reduce parsing it immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each production is reduced, and the `Parser` macro’s job is to analyze those rules at compile time and generate the LR parse tables. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree structure. (See [The Compilation Pipeline](pipeline.md) for the full picture.)

Copilot uses AI. Check for mistakes.
that appear in source text. In a lexer, each token class acts as a terminal: it names a category
of strings, and no lexer-level expansion applies below it.

See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relative link theory/cfg.md is incorrect from within docs/_docs/theory/tokens.md (it resolves to .../theory/theory/cfg.md). It should link to cfg.md (same directory) to avoid a broken cross-link.

Suggested change
See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
See [Context-Free Grammars](cfg.md) for how terminals fit into production rules.

Copilot uses AI. Check for mistakes.
// result: Double | Null = 11.0
```

`CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalcParser.parse does not build a parse tree internally; the runtime performs shift/reduce and applies semantic actions directly (see src/alpaca/internal/parser/Parser.scala, where reductions call actionTable(prod)(ctx, children) and no tree structure is retained). This sentence should be reworded to avoid implying a concrete internal parse tree data structure.

Suggested change
`CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result.
`CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it consumes those lexemes using the generated LR(1) parse table and your semantic actions to compute the typed result, without constructing an explicit parse tree data structure.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +83
- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cross-links use theory/... paths even though this file already lives in docs/_docs/theory/. As written, they resolve to .../theory/theory/... and break. They should link to tokens.md and lexer-fa.md (same directory).

Suggested change
- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
- Next: [Tokens & Lexemes](tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +83
This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The claim that using Java regex implies an O(n) lexer with "no backtracking" is not accurate. Tokenization.tokenize() uses java.util.regex.Pattern/Matcher.lookingAt(), and Java's regex engine can backtrack with super-linear (even exponential) worst-cases depending on patterns; additionally, ctx.text = ctx.text.from(m.end) will create new subsequences for String inputs. Please soften this to an implementation description (combined ordered regex) without promising DFA-like complexity guarantees unless you can justify them for all supported patterns.

Suggested change
This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
In practice, this means Alpaca's lexer uses a single pre-compiled combined regex and scans
through the input from left to right, matching at each position with `lookingAt()` and using
the named capturing groups to determine the token class; the exact performance and any
backtracking behavior are determined by the Java regex engine and the specific token patterns.

Copilot uses AI. Check for mistakes.
|-----------|-------------------|-------------|-------|
| LR(0) | None (reduce always) | Smallest | Too weak for most real grammars |
| SLR(1) | FOLLOW sets (global per non-terminal) | Same as LR(0) | Better, still limited |
| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) |
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANTLR is not an LALR(1) parser generator; it uses an LL(*) / adaptive prediction approach. Listing it as an example of LALR(1) here is misleading—please remove ANTLR from the LALR(1) examples or replace it with an actual LALR(1) tool.

Suggested change
| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison, ANTLR) |
| LALR(1) | Per-state lookahead (merged item-set cores) | Same as LR(0)/SLR | Most common in practice (yacc, Bison) |

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

📊 Test Compilation Benchmark

Branch Average Time
Base (master) 55.998s
Current (theory-application) 49.307s

Result: Current branch is 6.691s faster (11.95%) ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants