Skip to content

Add compiler theory foundation pages (pipeline, tokens, lexer/FA)#261

Open
halotukozak wants to merge 3 commits intomasterfrom
theory-foundation
Open

Add compiler theory foundation pages (pipeline, tokens, lexer/FA)#261
halotukozak wants to merge 3 commits intomasterfrom
theory-foundation

Conversation

@halotukozak
Copy link
Copy Markdown
Owner

Summary

  • theory/pipeline.md — The Compilation Pipeline: source → tokens → parse tree → typed result, Alpaca's compile-time macro vs runtime split, formal function-composition definition parse ∘ tokenize : String → R
  • theory/tokens.md — Tokens & Lexemes: terminal symbols, token classes, lexeme triple (T, w, pos), CalcLexer canonical definition with all 7 tokens (NUMBER/PLUS/MINUS/TIMES/DIVIDE/LPAREN/RPAREN)
  • theory/lexer-fa.md — The Lexer: Regex to Finite Automata: regular languages, NFA → DFA concept, DFA 5-tuple formal definition, how Alpaca's lexer macro compiles regex at compile time, shadow detection via dregex

All three pages include formal definition blocks, compile-time processing callouts, and cross-links to the corresponding Alpaca reference docs (lexer.md, parser.md).

Part of the v1.1 Compiler Theory Tutorial milestone — Phase 8: Theory Foundation (TH-01, TH-02, TH-03).

Test plan

  • ./mill docJar passes (all examples compile — macro blocks use sc:nocompile)
  • theory/pipeline.md contains > **Compile-time processing:** callout and formal definition
  • theory/tokens.md contains lexeme triple definition and full CalcLexer 7-token table
  • theory/lexer-fa.md contains DFA 5-tuple definition and NFA/DFA conceptual section
  • No LaTeX ($), no extends Parser grammar leakage, no sc:compile on macro blocks

🤖 Generated with Claude Code

- Create docs/_docs/theory/tokens.md (TH-02)
- Terminal symbols definition, token class vs instance distinction
- Formal lexeme definition as triple (T, w, pos) for CROSS-02
- CalcLexer 7-token class table with patterns and value types
- Canonical CalcLexer definition with sc:nocompile
- Tokenization output code example with sc:nocompile
- Cross-links to lexer.md and lexer-fa.md
- New docs/_docs/theory/ directory with pipeline.md as opening theory page
- Explains four compilation stages: source text, lexical analysis, syntactic analysis, semantic analysis
- Documents Alpaca's compile-time vs runtime boundary with standard callout block
- Formal definition block using Unicode math: parse ∘ tokenize : String → R
- Pipeline code example with sc:nocompile referencing CalcLexer and CalcParser
- Type mapping table: pipeline stages to Alpaca types (List[Lexeme], R | Null)
- Cross-links to lexer.md and parser.md
- No CalcParser grammar notation, no LaTeX, all macro blocks marked sc:nocompile
- Create docs/_docs/theory/lexer-fa.md (TH-03)
- Regular language formal definition block for CROSS-02
- NFA/DFA conceptual explanation with state transition table for PLUS token
- DFA 5-tuple formal definition (Q, Σ, δ, q₀, F) for CROSS-02
- Combined alternation pattern explanation grounded in Alpaca internals
- Shadowing detection via dregex subset checking
- Standard compile-time callout block
- Cross-links to lexer.md and tokens.md
Copilot AI review requested due to automatic review settings February 20, 2026 15:01
@github-actions github-actions bot added documentation Improvements or additions to documentation thesis labels Feb 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds three new “Compiler Theory” documentation pages to explain Alpaca’s compilation pipeline, token/lexeme vocabulary, and how lexing relates to regex/finite automata.

Changes:

  • Introduces a pipeline overview page with compile-time vs runtime boundary and a formal composition definition.
  • Adds a tokens/lexemes page defining token classes vs instances and a canonical CalcLexer token table/definition.
  • Adds a lexer/FA page covering regular languages, DFA definition, pattern combination, and shadowing detection.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.

File Description
docs/_docs/theory/pipeline.md New theory page describing the end-to-end pipeline and compile-time/runtime split.
docs/_docs/theory/tokens.md New theory page defining tokens/lexemes and documenting CalcLexer’s token set.
docs/_docs/theory/lexer-fa.md New theory page explaining regex→FA concepts and how Alpaca tokenization/shadowing works.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +116 to +122
`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the
macro does internally.

## Cross-links

- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link points at the Markdown source (lexer-fa.md), but the generated documentation site uses .html pages (as in the rest of docs/_docs). Consider linking to lexer-fa.html so the cross-link works in the rendered site.

Suggested change
`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the
macro does internally.
## Cross-links
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token
`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.html) for what the
macro does internally.
## Cross-links
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [The Lexer: Regex to Finite Automata](lexer-fa.html) for how regex patterns define token

Copilot uses AI. Check for mistakes.

## Cross-links

- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the rest of the docs (which link to *.html pages), this cross-link should likely be ../lexer.html rather than ../lexer.md so it works in the generated site.

Suggested change
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [Lexer](../lexer.html) for the full `lexer` DSL reference and all token forms.

Copilot uses AI. Check for mistakes.
## Cross-links

- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link points at tokens.md, but the generated site uses .html pages. Consider linking to tokens.html for a working cross-link in the rendered docs.

Suggested change
- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
- See [Tokens and Lexemes](tokens.html) for what the lexer produces — the lexeme stream.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +51
In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
pieces of information:

- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
for PLUS
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says a lexeme carries “four pieces of information”, but Alpaca lexemes also include the matched string as lexeme.text (and more generally expose all lexer context fields via dynamic selection). Consider updating this description to include text / clarify that additional fields come from the lexer context.

Suggested change
In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
pieces of information:
- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
for PLUS
In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries five
core pieces of information:
- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
for PLUS
- `text` — the matched source substring (also available as `lexeme.text`)

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +100
A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every
string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's
language is also in A's language), and A appears before B in the lexer definition. If this
occurs, B will never match — it is dead code.

Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex
operations) to check at compile time whether any pattern's language is a subset of an earlier
pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a
compile error pointing to the offending patterns.

**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer
pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shadowing definition here uses plain language subset L(B) ⊆ L(A), but Alpaca’s implementation checks subset on pattern + ".*" (prefix languages) because matching uses lookingAt and doesn’t require a full-string match. Consider updating the definition to reflect this prefix-based notion so the described behavior matches the actual checker.

Suggested change
A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every
string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's
language is also in A's language), and A appears before B in the lexer definition. If this
occurs, B will never match — it is dead code.
Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex
operations) to check at compile time whether any pattern's language is a subset of an earlier
pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a
compile error pointing to the offending patterns.
**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer
pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this
A practical issue with ordered alternation is *shadowing*. Intuitively, pattern A shadows pattern
B if, whenever B could match starting at some input position, A can also match some (possibly
shorter) prefix there, and A appears before B in the lexer definition. In that situation B will
never be the pattern that the lexer chooses — it is effectively dead code.
Formally, because Alpaca's lexer uses `matcher.lookingAt()`, matching is prefix-based rather than
whole-string based. Alpaca's `RegexChecker` therefore uses the `dregex` library (a Scala/JVM
library for decidable regex operations) to check at compile time whether the *prefix-extended*
language of a later pattern (conceptually its language with `".*"` appended) is a subset of the
prefix-extended language of an earlier pattern. If such a subset relation holds, the macro throws
a `ShadowException` with a compile error pointing to the offending patterns.
**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
`"3.14"` is also matched by `"[0-9]+"` on its initial digits, and the integer pattern can match
the prefix `"3"` and would consume it first. The prefix-based `dregex` check catches this

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +68
Alpaca follows the same principle but implements it using Java's regex engine, which is itself
backed by NFA/DFA machinery:

Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java’s java.util.regex.Pattern is a backtracking regex engine; it’s not a DFA execution model and can have super-linear worst-case behavior for certain patterns. Consider rephrasing this paragraph to avoid implying DFA semantics/guarantees from using Java regexes.

Copilot uses AI. Check for mistakes.
that appear in source text. In a lexer, each token class acts as a terminal: it names a category
of strings, and no lexer-level expansion applies below it.

See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link target theory/cfg.md is both an incorrect relative path from this page (it would resolve to theory/theory/cfg.md) and there is no cfg.md page under docs/_docs/theory/. Add the missing CFG page or update/remove this link so it resolves correctly.

Suggested change
See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
See the discussion of context-free grammars for how terminals fit into production rules.

Copilot uses AI. Check for mistakes.

The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input.

Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement contradicts earlier text on the page that CalcParser.parse “handles stages 3–4” and that Alpaca “stops at stage 4”. If semantic analysis/evaluation happens via parser semantic actions, Alpaca effectively covers stage 4 as well; please reconcile the stage numbering here for consistency.

Suggested change
Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

Copilot uses AI. Check for mistakes.

- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link theory/cfg.md is broken from this page (it would resolve to theory/theory/cfg.md), and there is no CFG page under docs/_docs/theory/. Add the missing page or update this cross-link to an existing target.

Suggested change
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
- Next: [Context-Free Grammars](cfg.md) for how token streams are parsed.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +83
This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying this “means … O(n) … no backtracking” is not accurate for Java regex in general. It would be safer to describe the implementation (combined pattern + lookingAt) without promising DFA-like time guarantees.

Suggested change
This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
In practice, this combined-pattern approach lets Alpaca's lexer scan the input from left to
right in a single pass, much like a hand-built DFA-based lexer. However, it still relies on
Java's backtracking regex engine internally, so Alpaca does not claim a strict worst-case O(n)
time guarantee or the complete absence of backtracking for arbitrary token patterns.

Copilot uses AI. Check for mistakes.
Base automatically changed from compile-time-callouts-and-error-catalog to master February 23, 2026 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation thesis

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants