Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions docs/_docs/theory/cfg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
Context-free grammars are the backbone of syntactic analysis. A grammar defines a language by specifying how symbols can be combined and rewritten — "context-free" means each rule applies regardless of surrounding context. If the lexer is the vocabulary of a language, the grammar is its syntax.

## What is a Context-Free Grammar?

A grammar consists of a set of non-terminal symbols (grammar variables that can be expanded), a set of terminal symbols (the tokens the lexer produces), a set of production rules (rewrite rules), and a start symbol. A derivation starts from the start symbol and repeatedly replaces non-terminals with production right-hand sides until only terminals remain. The language of a grammar G is the set of all terminal strings reachable from the start symbol.

> **Definition — Context-Free Grammar:**
> A CFG is a 4-tuple G = (V, Σ, R, S) where:
> - V is a finite set of non-terminal symbols (grammar variables)
> - Σ is a finite set of terminal symbols (tokens), V ∩ Σ = ∅
> - R ⊆ V × (V ∪ Σ)* is a finite set of production rules
> - S ∈ V is the start symbol
>
> A production rule A → α means the non-terminal A can be replaced by the symbol string α.
> A grammar generates the language L(G) = { w ∈ Σ* | S ⇒* w } — all terminal strings
> derivable from S in zero or more steps.

## BNF Notation

Production rules are written in Backus-Naur Form (BNF): `A → α` means A can be rewritten as α. The vertical bar `|` separates alternatives, so `A → α | β` is shorthand for two rules. Non-terminals are written in CamelCase; terminals are UPPERCASE (matching Alpaca's token name conventions).

EBNF (Extended BNF) adds optional elements `[...]`, repetition `{...}`, and grouping `(...)`. These shorthands can always be translated into plain BNF, but are useful for compact notation. This page uses BNF throughout for clarity; Alpaca's DSL maps directly to BNF productions.

## The Calculator Grammar

The calculator grammar is the running example for the entire Compiler Theory Tutorial. It defines arithmetic expressions with four operators and parentheses:

```
Expr → Expr PLUS Expr
| Expr MINUS Expr
| Expr TIMES Expr
| Expr DIVIDE Expr
| LPAREN Expr RPAREN
| NUMBER

root → Expr
```

Identifying the 4-tuple components:

- V = {Expr, root} — two non-terminals
- Σ = {NUMBER, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN} — seven terminal symbols, produced by CalcLexer
- R = the 7 production rules above
- S = root — the start symbol

Note: this grammar is **ambiguous** — the expression `1 + 2 * 3` can be parsed in two ways depending on which `Expr` is expanded first. We will see how Alpaca resolves ambiguities on the [Conflict Resolution](../conflict-resolution.md) page.

## Derivation

A *derivation* is a sequence of rewriting steps from the start symbol to a terminal string. Each step replaces the leftmost non-terminal with one of its production alternatives (leftmost derivation).

Leftmost derivation for `1 + 2`:

```
root ⇒ Expr
⇒ Expr PLUS Expr (apply: Expr → Expr PLUS Expr)
⇒ NUMBER PLUS Expr (apply: Expr → NUMBER, leftmost)
⇒ NUMBER PLUS NUMBER (apply: Expr → NUMBER, leftmost)
```

The first step applies `root → Expr`; the second expands the leftmost `Expr` using the `Expr PLUS Expr` production; the third and fourth substitute the literal `NUMBER` terminal for each remaining `Expr`.

## Parse Trees

A parse tree captures the grammatical structure of a derivation as a tree. Each internal node is a non-terminal; each leaf is a terminal. The parse tree for `1 + 2`:

```
root
|
Expr
/ | \
Expr PLUS Expr
| |
NUMBER NUMBER
(1.0) (2.0)
```

Note: In Alpaca, the parse tree is never exposed to user code. The `Parser` macro builds it internally during the shift-reduce parse, and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each node is reduced. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree object. (See [The Compilation Pipeline](pipeline.md) for the full picture.)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note says the Parser macro "builds" a parse tree internally. The runtime implementation performs shift/reduce and immediately evaluates semantic actions; it does not construct or retain a parse tree object (see src/alpaca/internal/parser/Parser.scala). Please rephrase to avoid contradicting the shift/reduce page (which states no parse tree is constructed).

Suggested change
Note: In Alpaca, the parse tree is never exposed to user code. The `Parser` macro builds it internally during the shift-reduce parse, and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each node is reduced. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree object. (See [The Compilation Pipeline](pipeline.md) for the full picture.)
Note: In Alpaca, parse trees are a conceptual model only; the runtime LR parser does not construct or retain an explicit parse-tree object. During shift–reduce parsing it immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each production is reduced, and the `Parser` macro’s job is to analyze those rules at compile time and generate the LR parse tables. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree structure. (See [The Compilation Pipeline](pipeline.md) for the full picture.)

Copilot uses AI. Check for mistakes.

## Alpaca DSL Mapping

The calculator grammar maps directly to an Alpaca `Parser` definition. Each production rule becomes a case clause in a `rule(...)` call; the right-hand side pattern matches the grammatical structure, and the right-hand side expression computes the result.

```scala sc:nocompile
import alpaca.*

object CalcParser extends Parser:
val Expr: Rule[Double] = rule(
{ case (Expr(a), CalcLexer.PLUS(_), Expr(b)) => a + b },
{ case (Expr(a), CalcLexer.MINUS(_), Expr(b)) => a - b },
{ case (Expr(a), CalcLexer.TIMES(_), Expr(b)) => a * b },
{ case (Expr(a), CalcLexer.DIVIDE(_), Expr(b)) => a / b },
{ case (CalcLexer.LPAREN(_), Expr(e), CalcLexer.RPAREN(_)) => e },
{ case CalcLexer.NUMBER(n) => n.value },
)
val root: Rule[Double] = rule:
case Expr(v) => v
```

Each `case` clause corresponds to one production rule. `Expr(a)` matches a reduced `Expr` non-terminal with value `a`. `CalcLexer.PLUS(_)` matches the PLUS terminal (the `_` discards the lexeme value since PLUS carries `Unit`). `CalcLexer.NUMBER(n)` matches a NUMBER terminal; `n.value` accesses the `Double` extracted by the lexer. The grammar's non-terminals (`Expr`, `root`) become `Rule[Double]` values; the type parameter is the result type of each reduction.

> **Compile-time processing:** When you define `object CalcParser extends Parser`, the Alpaca macro reads every `rule` declaration and constructs the LR(1) parse table at compile time.

## Cross-links

- See [Tokens and Lexemes](tokens.md) for how the terminal symbols (NUMBER, PLUS, etc.) are produced by the lexer.
- Next: [Why LR?](why-lr.md) — why LR parsing was chosen over top-down alternatives.
- See [Parser](../parser.md) for the complete `rule` DSL reference and all extractor forms.
- See [Conflict Resolution](../conflict-resolution.md) for how Alpaca resolves ambiguity in the calculator grammar.
112 changes: 112 additions & 0 deletions docs/_docs/theory/lexer-fa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# The Lexer: Regex to Finite Automata

## What Does a Lexer Do?

A lexer reads a character stream from left to right and emits a token stream. Each scan step
finds the longest prefix of the remaining input that matches one of the token class patterns —
this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
an error. The result is a flat list of lexemes that the parser consumes next.
Comment on lines +5 to +8
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lexer implementation does not enforce a global "maximal munch" across token patterns; it matches the first alternative that succeeds in the combined ordered regex (rule-order priority), not necessarily the longest among all token classes. Please update this description to match the actual "patterns are tried in order; first match wins" behavior.

Suggested change
A lexer reads a character stream from left to right and emits a token stream. Each scan step
finds the longest prefix of the remaining input that matches one of the token class patterns —
this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
an error. The result is a flat list of lexemes that the parser consumes next.
A lexer reads a character stream from left to right and emits a token stream. At each position,
it tries the token class patterns in their specified order and picks the first one whose regex
matches a prefix of the remaining input — patterns are tried in order; first match wins. When no
pattern matches the current position, the lexer throws an error. The result is a flat list of
lexemes that the parser consumes next.

Copilot uses AI. Check for mistakes.

## Regular Languages

> **Definition — Regular language:**
> A language L ⊆ Σ* is *regular* if it is recognized by a finite automaton (FA). Equivalently,
> L can be described by a regular expression over alphabet Σ.
> Each token class defines a regular language: `NUMBER` defines the set
> { "0", "1", ..., "3.14", "100", ... }.

Regex notation is a concise way to specify regular languages. This is why regex is the right
tool for token class definitions — token classes have a "look ahead a bounded amount" structure
that regular languages capture exactly. More complex patterns such as balanced parentheses
require a more powerful formalism (context-free grammars, which the parser handles), but for
token recognition, regular expressions are both necessary and sufficient.

## NFA and DFA: The Conceptual Picture

Any regular expression can be translated into a finite automaton that accepts the same strings.
The standard construction proceeds in two steps.

**Step 1 — NFA (nondeterministic finite automaton).** A regex is converted into an NFA via
Thompson's construction. An NFA can have multiple possible transitions from a state on the same
input, or transitions on the empty string. For simple patterns this is easy to visualize. The
`PLUS` token pattern `\+` produces a two-state NFA:

| State | Input `+` | Accept? |
|-------|-----------|---------|
| q₀ | q₁ | No |
| q₁ | — | Yes |

The machine starts at q₀, consumes a `+`, and moves to q₁ — an accepting state. Any other
input from q₀ leads nowhere, meaning the string does not match.

**Step 2 — DFA (deterministic finite automaton).** An NFA is then converted to a DFA. A DFA
has exactly one transition per (state, input-character) pair, with no ambiguity. This matters
for performance: a DFA can be executed in O(n) time by reading the input left to right, one
character at a time, following the single applicable transition at each step. A DFA is therefore
the right runtime data structure for a lexer — no backtracking, no branching.

> **Definition — Deterministic Finite Automaton (DFA):**
> A DFA is a 5-tuple (Q, Σ, δ, q₀, F) where:
> - Q is a finite set of states
> - Σ is the input alphabet (here: Unicode characters)
> - δ : Q × Σ → Q is the transition function
> - q₀ ∈ Q is the start state
> - F ⊆ Q is the set of accepting states
>
> A DFA accepts a string w if δ*(q₀, w) ∈ F, where δ* is the iterated transition function.
> In Alpaca's combined lexer DFA, each accepting state also carries a *token label* indicating
> which token class was matched.

## Combining Token Patterns into One Automaton

To lex a language with multiple token classes, the standard approach builds one combined DFA. In
theory: construct an NFA for each token pattern, connect them all to a new start state with
epsilon transitions, then convert the combined NFA to a single DFA.

Alpaca follows the same principle but implements it using Java's regex engine, which is itself
backed by NFA/DFA machinery:

- All token patterns are combined into a single Java regex alternation at compile time:

```
// Conceptual: how Alpaca combines patterns internally
(?<NUMBER>[0-9]+(\.[0-9]+)?)|(?<PLUS>\+)|(?<MINUS>-)|(?<TIMES>\*)|...
```

- `java.util.regex.Pattern.compile(...)` is called inside the `lexerImpl` macro at compile
time. An invalid regex pattern therefore causes a compile error, not a runtime crash.
- At runtime, `Tokenization.tokenize()` uses `matcher.lookingAt()` on the combined pattern at
the current input position. It then checks which named group matched using
`matcher.start(i)` to determine the token class.

This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
Comment on lines +82 to +83
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The claim that using Java regex implies an O(n) lexer with "no backtracking" is not accurate. Tokenization.tokenize() uses java.util.regex.Pattern/Matcher.lookingAt(), and Java's regex engine can backtrack with super-linear (even exponential) worst-cases depending on patterns; additionally, ctx.text = ctx.text.from(m.end) will create new subsequences for String inputs. Please soften this to an implementation description (combined ordered regex) without promising DFA-like complexity guarantees unless you can justify them for all supported patterns.

Suggested change
This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
In practice, this means Alpaca's lexer uses a single pre-compiled combined regex and scans
through the input from left to right, matching at each position with `lookingAt()` and using
the named capturing groups to determine the token class; the exact performance and any
backtracking behavior are determined by the Java regex engine and the specific token patterns.

Copilot uses AI. Check for mistakes.

## Shadowing Detection

A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every
string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's
language is also in A's language), and A appears before B in the lexer definition. If this
occurs, B will never match — it is dead code.

Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex
operations) to check at compile time whether any pattern's language is a subset of an earlier
pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a
compile error pointing to the offending patterns.

**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer
pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this
ordering mistake at compile time rather than silently producing wrong output at runtime.

In `CalcLexer`, the decimal pattern `"[0-9]+(\\.[0-9]+)?"` is listed first, before any simpler
integer-only pattern, so no shadowing occurs.

> **Compile-time processing:** The `lexer` macro validates all regex patterns, combines them into a single alternation pattern, and checks for shadowing using `dregex` — all at compile time. If a regex is invalid or one pattern shadows another, you get a compile error. At runtime, the generated `Tokenization` object runs the pre-compiled combined regex against your input string.

## Cross-links

- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cross-link theory/cfg.md is incorrect from within docs/_docs/theory/lexer-fa.md (it resolves to .../theory/theory/cfg.md). It should link to cfg.md (same directory).

Suggested change
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
- Next: [Context-Free Grammars](cfg.md) for how token streams are parsed.

Copilot uses AI. Check for mistakes.
88 changes: 88 additions & 0 deletions docs/_docs/theory/pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# The Compilation Pipeline

Source text is just a string. A compiler pipeline is a sequence of transformations that turns that string into something structured and meaningful. Each stage takes the output of the previous one, narrowing the representation from raw text to a typed result.

Understanding the pipeline gives you a mental model that applies to every Alpaca program you write — not just calculator expressions, but any language you define with the library.

## The Four Stages

Most compilers share the same four-stage structure:

1. **Source text** — the raw input string, e.g., `"3 + 4 * 2"`
2. **Lexical analysis** — groups characters into tokens: `NUMBER(3.0)`, `PLUS`, `NUMBER(4.0)`, `TIMES`, `NUMBER(2.0)`
3. **Syntactic analysis** — arranges tokens into a parse tree (concrete syntax tree) that encodes grammatical structure
4. **Semantic analysis / evaluation** — extracts meaning from the tree, producing a typed result (in a calculator: `Double`)

Some compilers add a fifth stage — code generation — that emits machine code or bytecode. Alpaca stops at stage 4: its pipeline produces a typed Scala value, not machine code.

## Alpaca's Pipeline

With Alpaca, running the full pipeline takes two calls:

```scala sc:nocompile
// Full pipeline: source text → typed result
val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2")
// lexemes: List[Lexeme] — NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0)

val (_, result) = CalcParser.parse(lexemes)
// result: Double | Null = 11.0
```

`CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result.
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalcParser.parse does not build a parse tree internally; the runtime performs shift/reduce and applies semantic actions directly (see src/alpaca/internal/parser/Parser.scala, where reductions call actionTable(prod)(ctx, children) and no tree structure is retained). This sentence should be reworded to avoid implying a concrete internal parse tree data structure.

Suggested change
`CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result.
`CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it consumes those lexemes using the generated LR(1) parse table and your semantic actions to compute the typed result, without constructing an explicit parse tree data structure.

Copilot uses AI. Check for mistakes.

Both `CalcLexer` and `CalcParser` are objects generated by Alpaca's macros. Their definitions live in separate files (see the cross-links at the bottom of this page).

## Compile-time vs Runtime Boundary

Alpaca draws a sharp line between what happens at compile time and what happens at runtime. This is the most important thing to understand about the library.

> **Compile-time processing:** When you write a `lexer` definition, the Scala 3 macro validates your regex patterns, checks for shadowing, and generates the `Tokenization` object. When you write a `Parser` definition, the macro reads your grammar, builds the LR(1) parse table, and detects any shift/reduce conflicts — all at compile time. At runtime, `tokenize(input)` and `parse(lexemes)` execute the pre-generated code.

In concrete terms:

**Compile time:**
- The `lexer` macro validates regex patterns, detects shadowing (where one pattern makes another unreachable), and emits a `Tokenization` object
- The `Parser` macro reads every `Rule` declaration, constructs the LR(1) parse table, and reports any shift/reduce or reduce/reduce conflicts as compile errors

**Runtime:**
- `tokenize(input)` executes the pre-generated code and returns `List[Lexeme]`
- `parse(lexemes)` executes the pre-built parse table and returns the typed result

The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input.

Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

## Formal Definition

> **Definition — Compilation pipeline:**
> A compiler pipeline is a composition of transformations f₁ ∘ f₂ ∘ ... ∘ fₙ where each fᵢ maps the output of fᵢ₋₁ to a more structured representation.
> Alpaca's pipeline: `parse ∘ tokenize : String → R` where R is the root non-terminal's result type.

For the calculator example, `R` is `Double`. For a JSON parser, `R` might be `Any` or a custom AST type. The pipeline shape is always the same; only the result type changes.

The parser internally appends a special `Lexeme.EOF` marker to the lexeme list before running the shift/reduce loop. This is an implementation detail — you do not need to add it yourself.

## Mapping the Stages to Alpaca Types

Each pipeline stage corresponds to a concrete Alpaca type:

| Stage | Input | Output | Alpaca Type |
|-------|-------|--------|-------------|
| Source text | — | `String` | `String` (plain Scala) |
| Lexical analysis | `String` | token stream | `List[Lexeme]` |
| Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) |
| Semantic analysis | parse tree | typed result | `R \| Null` (your root type) |

The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree.
Comment on lines +73 to +76
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stage mapping and following paragraph continue to describe the syntactic stage as producing a "parse tree (internal)". This conflicts with the actual parser runtime, which only maintains an LR stack and computes results on reductions without constructing a tree object. Consider describing this as a conceptual parse structure or as "LR(1) stack + reductions" rather than an internal parse tree.

Suggested change
| Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) |
| Semantic analysis | parse tree | typed result | `R \| Null` (your root type) |
The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree.
| Syntactic analysis | `List[Lexeme]` | conceptual parse structure (via LR(1) stack + reductions) | LR(1) stack + reductions (internal) |
| Semantic analysis | conceptual parse structure | typed result | `R \| Null` (your root type) |
Alpaca never constructs or returns an explicit parse tree object. Instead, it uses an LR(1) stack and applies your semantic actions (the `=>` expressions in `rule` definitions) on each reduction, so what you get back from `parse` is the final typed value, not an intermediate tree.

Copilot uses AI. Check for mistakes.

## What Comes Next

The rest of the Compiler Theory Tutorial builds on this mental model:

- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
Comment on lines +82 to +83
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cross-links use theory/... paths even though this file already lives in docs/_docs/theory/. As written, they resolve to .../theory/theory/... and break. They should link to tokens.md and lexer-fa.md (same directory).

Suggested change
- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
- Next: [Tokens & Lexemes](tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them

Copilot uses AI. Check for mistakes.

For the full API, see the reference pages:

- See [Lexer](lexer.md) for how `CalcLexer` is defined.
- See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
Loading