Add compiler theory foundation pages (pipeline, tokens, lexer/FA) by halotukozak · Pull Request #261 · halotukozak/alpaca

halotukozak · 2026-02-20T15:01:50Z

Summary

theory/pipeline.md — The Compilation Pipeline: source → tokens → parse tree → typed result, Alpaca's compile-time macro vs runtime split, formal function-composition definition parse ∘ tokenize : String → R
theory/tokens.md — Tokens & Lexemes: terminal symbols, token classes, lexeme triple (T, w, pos), CalcLexer canonical definition with all 7 tokens (NUMBER/PLUS/MINUS/TIMES/DIVIDE/LPAREN/RPAREN)
theory/lexer-fa.md — The Lexer: Regex to Finite Automata: regular languages, NFA → DFA concept, DFA 5-tuple formal definition, how Alpaca's lexer macro compiles regex at compile time, shadow detection via dregex

All three pages include formal definition blocks, compile-time processing callouts, and cross-links to the corresponding Alpaca reference docs (lexer.md, parser.md).

Part of the v1.1 Compiler Theory Tutorial milestone — Phase 8: Theory Foundation (TH-01, TH-02, TH-03).

Test plan

./mill docJar passes (all examples compile — macro blocks use sc:nocompile)
theory/pipeline.md contains > **Compile-time processing:** callout and formal definition
theory/tokens.md contains lexeme triple definition and full CalcLexer 7-token table
theory/lexer-fa.md contains DFA 5-tuple definition and NFA/DFA conceptual section
No LaTeX ($), no extends Parser grammar leakage, no sc:compile on macro blocks

🤖 Generated with Claude Code

- Create docs/_docs/theory/tokens.md (TH-02) - Terminal symbols definition, token class vs instance distinction - Formal lexeme definition as triple (T, w, pos) for CROSS-02 - CalcLexer 7-token class table with patterns and value types - Canonical CalcLexer definition with sc:nocompile - Tokenization output code example with sc:nocompile - Cross-links to lexer.md and lexer-fa.md

- New docs/_docs/theory/ directory with pipeline.md as opening theory page - Explains four compilation stages: source text, lexical analysis, syntactic analysis, semantic analysis - Documents Alpaca's compile-time vs runtime boundary with standard callout block - Formal definition block using Unicode math: parse ∘ tokenize : String → R - Pipeline code example with sc:nocompile referencing CalcLexer and CalcParser - Type mapping table: pipeline stages to Alpaca types (List[Lexeme], R | Null) - Cross-links to lexer.md and parser.md - No CalcParser grammar notation, no LaTeX, all macro blocks marked sc:nocompile

- Create docs/_docs/theory/lexer-fa.md (TH-03) - Regular language formal definition block for CROSS-02 - NFA/DFA conceptual explanation with state transition table for PLUS token - DFA 5-tuple formal definition (Q, Σ, δ, q₀, F) for CROSS-02 - Combined alternation pattern explanation grounded in Alpaca internals - Shadowing detection via dregex subset checking - Standard compile-time callout block - Cross-links to lexer.md and tokens.md

Copilot

Pull request overview

Adds three new “Compiler Theory” documentation pages to explain Alpaca’s compilation pipeline, token/lexeme vocabulary, and how lexing relates to regex/finite automata.

Changes:

Introduces a pipeline overview page with compile-time vs runtime boundary and a formal composition definition.
Adds a tokens/lexemes page defining token classes vs instances and a canonical CalcLexer token table/definition.
Adds a lexer/FA page covering regular languages, DFA definition, pattern combination, and shadowing detection.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.

File	Description
docs/_docs/theory/pipeline.md	New theory page describing the end-to-end pipeline and compile-time/runtime split.
docs/_docs/theory/tokens.md	New theory page defining tokens/lexemes and documenting `CalcLexer`’s token set.
docs/_docs/theory/lexer-fa.md	New theory page explaining regex→FA concepts and how Alpaca tokenization/shadowing works.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T15:09:52Z

docs/_docs/theory/tokens.md

+`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the
+macro does internally.
+
+## Cross-links
+
+- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
+- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token


This link points at the Markdown source (lexer-fa.md), but the generated documentation site uses .html pages (as in the rest of docs/_docs). Consider linking to lexer-fa.html so the cross-link works in the rendered site.

Suggested change

`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the

macro does internally.

## Cross-links

- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.

- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token

`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.html) for what the

macro does internally.

## Cross-links

- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.

- See [The Lexer: Regex to Finite Automata](lexer-fa.html) for how regex patterns define token

Copilot · 2026-02-20T15:09:52Z

docs/_docs/theory/tokens.md

+
+## Cross-links
+
+- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.


For consistency with the rest of the docs (which link to *.html pages), this cross-link should likely be ../lexer.html rather than ../lexer.md so it works in the generated site.

Suggested change

- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.

- See [Lexer](../lexer.html) for the full `lexer` DSL reference and all token forms.

Copilot · 2026-02-20T15:09:53Z

docs/_docs/theory/lexer-fa.md

+## Cross-links
+
+- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
+- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.


This link points at tokens.md, but the generated site uses .html pages. Consider linking to tokens.html for a working cross-link in the rendered docs.

Suggested change

- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.

- See [Tokens and Lexemes](tokens.html) for what the lexer produces — the lexeme stream.

Copilot · 2026-02-20T15:09:53Z

docs/_docs/theory/tokens.md

+In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
+pieces of information:
+
+- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
+- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
+  for PLUS


This section says a lexeme carries “four pieces of information”, but Alpaca lexemes also include the matched string as lexeme.text (and more generally expose all lexer context fields via dynamic selection). Consider updating this description to include text / clarify that additional fields come from the lexer context.

Suggested change

In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four

pieces of information:

- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`

- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`

for PLUS

In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries five

core pieces of information:

- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`

- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`

for PLUS

- `text` — the matched source substring (also available as `lexeme.text`)

Copilot · 2026-02-20T15:09:53Z

docs/_docs/theory/lexer-fa.md

+A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every
+string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's
+language is also in A's language), and A appears before B in the lexer definition. If this
+occurs, B will never match — it is dead code.
+
+Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex
+operations) to check at compile time whether any pattern's language is a subset of an earlier
+pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a
+compile error pointing to the offending patterns.
+
+**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
+`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
+`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer
+pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this


The shadowing definition here uses plain language subset L(B) ⊆ L(A), but Alpaca’s implementation checks subset on pattern + ".*" (prefix languages) because matching uses lookingAt and doesn’t require a full-string match. Consider updating the definition to reflect this prefix-based notion so the described behavior matches the actual checker.

Suggested change

A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every

string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's

language is also in A's language), and A appears before B in the lexer definition. If this

occurs, B will never match — it is dead code.

Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex

operations) to check at compile time whether any pattern's language is a subset of an earlier

pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a

compile error pointing to the offending patterns.

**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern

`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like

`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer

pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this

A practical issue with ordered alternation is *shadowing*. Intuitively, pattern A shadows pattern

B if, whenever B could match starting at some input position, A can also match some (possibly

shorter) prefix there, and A appears before B in the lexer definition. In that situation B will

never be the pattern that the lexer chooses — it is effectively dead code.

Formally, because Alpaca's lexer uses `matcher.lookingAt()`, matching is prefix-based rather than

whole-string based. Alpaca's `RegexChecker` therefore uses the `dregex` library (a Scala/JVM

library for decidable regex operations) to check at compile time whether the *prefix-extended*

language of a later pattern (conceptually its language with `".*"` appended) is a subset of the

prefix-extended language of an earlier pattern. If such a subset relation holds, the macro throws

a `ShadowException` with a compile error pointing to the offending patterns.

**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern

`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like

`"3.14"` is also matched by `"[0-9]+"` on its initial digits, and the integer pattern can match

the prefix `"3"` and would consume it first. The prefix-based `dregex` check catches this

Copilot · 2026-02-20T15:09:55Z

docs/_docs/theory/lexer-fa.md

+Alpaca follows the same principle but implements it using Java's regex engine, which is itself
+backed by NFA/DFA machinery:
+


Java’s java.util.regex.Pattern is a backtracking regex engine; it’s not a DFA execution model and can have super-linear worst-case behavior for certain patterns. Consider rephrasing this paragraph to avoid implying DFA semantics/guarantees from using Java regexes.

Copilot · 2026-02-20T15:09:56Z

docs/_docs/theory/tokens.md

+that appear in source text. In a lexer, each token class acts as a terminal: it names a category
+of strings, and no lexer-level expansion applies below it.
+
+See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.


The link target theory/cfg.md is both an incorrect relative path from this page (it would resolve to theory/theory/cfg.md) and there is no cfg.md page under docs/_docs/theory/. Add the missing CFG page or update/remove this link so it resolves correctly.

Suggested change

See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.

See the discussion of context-free grammars for how terminals fit into production rules.

Copilot · 2026-02-20T15:09:56Z

docs/_docs/theory/pipeline.md

+
+The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input.
+
+Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.


This statement contradicts earlier text on the page that CalcParser.parse “handles stages 3–4” and that Alpaca “stops at stage 4”. If semantic analysis/evaluation happens via parser semantic actions, Alpaca effectively covers stage 4 as well; please reconcile the stage numbering here for consistency.

Suggested change

Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

Copilot · 2026-02-20T15:09:56Z

docs/_docs/theory/lexer-fa.md

+
+- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
+- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
+- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.


The link theory/cfg.md is broken from this page (it would resolve to theory/theory/cfg.md), and there is no CFG page under docs/_docs/theory/. Add the missing page or update this cross-link to an existing target.

Suggested change

- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.

- Next: [Context-Free Grammars](cfg.md) for how token streams are parsed.

Copilot · 2026-02-20T15:09:56Z

docs/_docs/theory/lexer-fa.md

+This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
+through the input, no backtracking.


Saying this “means … O(n) … no backtracking” is not accurate for Java regex in general. It would be safer to describe the implementation (combined pattern + lookingAt) without promising DFA-like time guarantees.

Suggested change

This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass

through the input, no backtracking.

In practice, this combined-pattern approach lets Alpaca's lexer scan the input from left to

right in a single pass, much like a hand-built DFA-based lexer. However, it still relies on

Java's backtracking regex engine internally, so Alpaca does not claim a strict worst-case O(n)

time guarantee or the complete absence of backtracking for arbitrary token patterns.

halotukozak added 3 commits February 20, 2026 09:50

Copilot AI review requested due to automatic review settings February 20, 2026 15:01

github-actions bot added documentation Improvements or additions to documentation thesis labels Feb 20, 2026

Copilot started reviewing on behalf of halotukozak February 20, 2026 15:02 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

Base automatically changed from compile-time-callouts-and-error-catalog to master February 23, 2026 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compiler theory foundation pages (pipeline, tokens, lexer/FA)#261

Add compiler theory foundation pages (pipeline, tokens, lexer/FA)#261
halotukozak wants to merge 3 commits intomasterfrom
theory-foundation

halotukozak commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## Cross-links

		- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.

	- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
	- See [Lexer](../lexer.html) for the full `lexer` DSL reference and all token forms.

	- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
	- See [Tokens and Lexemes](tokens.html) for what the lexer produces — the lexeme stream.

		Alpaca follows the same principle but implements it using Java's regex engine, which is itself
		backed by NFA/DFA machinery:

	See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
	See the discussion of context-free grammars for how terminals fit into production rules.


		The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input.

		Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

	Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
	Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

	- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
	- Next: [Context-Free Grammars](cfg.md) for how token streams are parsed.

		This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
		through the input, no backtracking.

-This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
-through the input, no backtracking.
+In practice, this combined-pattern approach lets Alpaca's lexer scan the input from left to
+right in a single pass, much like a hand-built DFA-based lexer. However, it still relies on
+Java's backtracking regex engine internally, so Alpaca does not claim a strict worst-case O(n)
+time guarantee or the complete absence of backtracking for arbitrary token patterns.

Conversation

halotukozak commented Feb 20, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants