Add compiler theory foundation pages (pipeline, tokens, lexer/FA)#261
Add compiler theory foundation pages (pipeline, tokens, lexer/FA)#261halotukozak wants to merge 3 commits intomasterfrom
Conversation
- Create docs/_docs/theory/tokens.md (TH-02) - Terminal symbols definition, token class vs instance distinction - Formal lexeme definition as triple (T, w, pos) for CROSS-02 - CalcLexer 7-token class table with patterns and value types - Canonical CalcLexer definition with sc:nocompile - Tokenization output code example with sc:nocompile - Cross-links to lexer.md and lexer-fa.md
- New docs/_docs/theory/ directory with pipeline.md as opening theory page - Explains four compilation stages: source text, lexical analysis, syntactic analysis, semantic analysis - Documents Alpaca's compile-time vs runtime boundary with standard callout block - Formal definition block using Unicode math: parse ∘ tokenize : String → R - Pipeline code example with sc:nocompile referencing CalcLexer and CalcParser - Type mapping table: pipeline stages to Alpaca types (List[Lexeme], R | Null) - Cross-links to lexer.md and parser.md - No CalcParser grammar notation, no LaTeX, all macro blocks marked sc:nocompile
- Create docs/_docs/theory/lexer-fa.md (TH-03) - Regular language formal definition block for CROSS-02 - NFA/DFA conceptual explanation with state transition table for PLUS token - DFA 5-tuple formal definition (Q, Σ, δ, q₀, F) for CROSS-02 - Combined alternation pattern explanation grounded in Alpaca internals - Shadowing detection via dregex subset checking - Standard compile-time callout block - Cross-links to lexer.md and tokens.md
There was a problem hiding this comment.
Pull request overview
Adds three new “Compiler Theory” documentation pages to explain Alpaca’s compilation pipeline, token/lexeme vocabulary, and how lexing relates to regex/finite automata.
Changes:
- Introduces a pipeline overview page with compile-time vs runtime boundary and a formal composition definition.
- Adds a tokens/lexemes page defining token classes vs instances and a canonical
CalcLexertoken table/definition. - Adds a lexer/FA page covering regular languages, DFA definition, pattern combination, and shadowing detection.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.
| File | Description |
|---|---|
| docs/_docs/theory/pipeline.md | New theory page describing the end-to-end pipeline and compile-time/runtime split. |
| docs/_docs/theory/tokens.md | New theory page defining tokens/lexemes and documenting CalcLexer’s token set. |
| docs/_docs/theory/lexer-fa.md | New theory page explaining regex→FA concepts and how Alpaca tokenization/shadowing works. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| `Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the | ||
| macro does internally. | ||
|
|
||
| ## Cross-links | ||
|
|
||
| - See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms. | ||
| - See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token |
There was a problem hiding this comment.
This link points at the Markdown source (lexer-fa.md), but the generated documentation site uses .html pages (as in the rest of docs/_docs). Consider linking to lexer-fa.html so the cross-link works in the rendered site.
| `Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the | |
| macro does internally. | |
| ## Cross-links | |
| - See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms. | |
| - See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token | |
| `Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.html) for what the | |
| macro does internally. | |
| ## Cross-links | |
| - See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms. | |
| - See [The Lexer: Regex to Finite Automata](lexer-fa.html) for how regex patterns define token |
|
|
||
| ## Cross-links | ||
|
|
||
| - See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms. |
There was a problem hiding this comment.
For consistency with the rest of the docs (which link to *.html pages), this cross-link should likely be ../lexer.html rather than ../lexer.md so it works in the generated site.
| - See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms. | |
| - See [Lexer](../lexer.html) for the full `lexer` DSL reference and all token forms. |
| ## Cross-links | ||
|
|
||
| - See [Lexer](../lexer.md) for the complete `lexer` DSL reference. | ||
| - See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream. |
There was a problem hiding this comment.
This link points at tokens.md, but the generated site uses .html pages. Consider linking to tokens.html for a working cross-link in the rendered docs.
| - See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream. | |
| - See [Tokens and Lexemes](tokens.html) for what the lexer produces — the lexeme stream. |
| In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four | ||
| pieces of information: | ||
|
|
||
| - `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"` | ||
| - `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit` | ||
| for PLUS |
There was a problem hiding this comment.
This section says a lexeme carries “four pieces of information”, but Alpaca lexemes also include the matched string as lexeme.text (and more generally expose all lexer context fields via dynamic selection). Consider updating this description to include text / clarify that additional fields come from the lexer context.
| In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four | |
| pieces of information: | |
| - `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"` | |
| - `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit` | |
| for PLUS | |
| In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries five | |
| core pieces of information: | |
| - `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"` | |
| - `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit` | |
| for PLUS | |
| - `text` — the matched source substring (also available as `lexeme.text`) |
| A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every | ||
| string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's | ||
| language is also in A's language), and A appears before B in the lexer definition. If this | ||
| occurs, B will never match — it is dead code. | ||
|
|
||
| Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex | ||
| operations) to check at compile time whether any pattern's language is a subset of an earlier | ||
| pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a | ||
| compile error pointing to the offending patterns. | ||
|
|
||
| **Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern | ||
| `"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like | ||
| `"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer | ||
| pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this |
There was a problem hiding this comment.
The shadowing definition here uses plain language subset L(B) ⊆ L(A), but Alpaca’s implementation checks subset on pattern + ".*" (prefix languages) because matching uses lookingAt and doesn’t require a full-string match. Consider updating the definition to reflect this prefix-based notion so the described behavior matches the actual checker.
| A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every | |
| string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's | |
| language is also in A's language), and A appears before B in the lexer definition. If this | |
| occurs, B will never match — it is dead code. | |
| Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex | |
| operations) to check at compile time whether any pattern's language is a subset of an earlier | |
| pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a | |
| compile error pointing to the offending patterns. | |
| **Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern | |
| `"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like | |
| `"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer | |
| pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this | |
| A practical issue with ordered alternation is *shadowing*. Intuitively, pattern A shadows pattern | |
| B if, whenever B could match starting at some input position, A can also match some (possibly | |
| shorter) prefix there, and A appears before B in the lexer definition. In that situation B will | |
| never be the pattern that the lexer chooses — it is effectively dead code. | |
| Formally, because Alpaca's lexer uses `matcher.lookingAt()`, matching is prefix-based rather than | |
| whole-string based. Alpaca's `RegexChecker` therefore uses the `dregex` library (a Scala/JVM | |
| library for decidable regex operations) to check at compile time whether the *prefix-extended* | |
| language of a later pattern (conceptually its language with `".*"` appended) is a subset of the | |
| prefix-extended language of an earlier pattern. If such a subset relation holds, the macro throws | |
| a `ShadowException` with a compile error pointing to the offending patterns. | |
| **Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern | |
| `"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like | |
| `"3.14"` is also matched by `"[0-9]+"` on its initial digits, and the integer pattern can match | |
| the prefix `"3"` and would consume it first. The prefix-based `dregex` check catches this |
| Alpaca follows the same principle but implements it using Java's regex engine, which is itself | ||
| backed by NFA/DFA machinery: | ||
|
|
There was a problem hiding this comment.
Java’s java.util.regex.Pattern is a backtracking regex engine; it’s not a DFA execution model and can have super-linear worst-case behavior for certain patterns. Consider rephrasing this paragraph to avoid implying DFA semantics/guarantees from using Java regexes.
| that appear in source text. In a lexer, each token class acts as a terminal: it names a category | ||
| of strings, and no lexer-level expansion applies below it. | ||
|
|
||
| See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules. |
There was a problem hiding this comment.
The link target theory/cfg.md is both an incorrect relative path from this page (it would resolve to theory/theory/cfg.md) and there is no cfg.md page under docs/_docs/theory/. Add the missing CFG page or update/remove this link so it resolves correctly.
| See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules. | |
| See the discussion of context-free grammars for how terminals fit into production rules. |
|
|
||
| The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input. | ||
|
|
||
| Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. |
There was a problem hiding this comment.
This statement contradicts earlier text on the page that CalcParser.parse “handles stages 3–4” and that Alpaca “stops at stage 4”. If semantic analysis/evaluation happens via parser semantic actions, Alpaca effectively covers stage 4 as well; please reconcile the stage numbering here for consistency.
| Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. | |
| Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly. |
|
|
||
| - See [Lexer](../lexer.md) for the complete `lexer` DSL reference. | ||
| - See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream. | ||
| - Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed. |
There was a problem hiding this comment.
The link theory/cfg.md is broken from this page (it would resolve to theory/theory/cfg.md), and there is no CFG page under docs/_docs/theory/. Add the missing page or update this cross-link to an existing target.
| - Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed. | |
| - Next: [Context-Free Grammars](cfg.md) for how token streams are parsed. |
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | ||
| through the input, no backtracking. |
There was a problem hiding this comment.
Saying this “means … O(n) … no backtracking” is not accurate for Java regex in general. It would be safer to describe the implementation (combined pattern + lookingAt) without promising DFA-like time guarantees.
| This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass | |
| through the input, no backtracking. | |
| In practice, this combined-pattern approach lets Alpaca's lexer scan the input from left to | |
| right in a single pass, much like a hand-built DFA-based lexer. However, it still relies on | |
| Java's backtracking regex engine internally, so Alpaca does not claim a strict worst-case O(n) | |
| time guarantee or the complete absence of backtracking for arbitrary token patterns. |
Summary
parse ∘ tokenize : String → R(T, w, pos), CalcLexer canonical definition with all 7 tokens (NUMBER/PLUS/MINUS/TIMES/DIVIDE/LPAREN/RPAREN)lexermacro compiles regex at compile time, shadow detection via dregexAll three pages include formal definition blocks, compile-time processing callouts, and cross-links to the corresponding Alpaca reference docs (lexer.md, parser.md).
Part of the v1.1 Compiler Theory Tutorial milestone — Phase 8: Theory Foundation (TH-01, TH-02, TH-03).
Test plan
./mill docJarpasses (all examples compile — macro blocks usesc:nocompile)theory/pipeline.mdcontains> **Compile-time processing:**callout and formal definitiontheory/tokens.mdcontains lexeme triple definition and full CalcLexer 7-token tabletheory/lexer-fa.mdcontains DFA 5-tuple definition and NFA/DFA conceptual section$), noextends Parsergrammar leakage, nosc:compileon macro blocks🤖 Generated with Claude Code