Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions docs/_docs/theory/lexer-fa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# The Lexer: Regex to Finite Automata

## What Does a Lexer Do?

A lexer reads a character stream from left to right and emits a token stream. Each scan step
finds the longest prefix of the remaining input that matches one of the token class patterns —
this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
an error. The result is a flat list of lexemes that the parser consumes next.
Comment on lines +5 to +8
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This describes lexing as “longest prefix … maximal munch”, but Alpaca’s lexer is explicitly “patterns are tried in order; the first match wins” (see docs/_docs/lexer.md). Given the lookingAt + ordered alternation implementation, this section should describe ordered-first matching rather than maximal-munch/longest-match.

Suggested change
A lexer reads a character stream from left to right and emits a token stream. Each scan step
finds the longest prefix of the remaining input that matches one of the token class patterns —
this is the *maximal munch* rule. When no pattern matches the current position, the lexer throws
an error. The result is a flat list of lexemes that the parser consumes next.
A lexer reads a character stream from left to right and emits a token stream. At each scan step,
it tries the token class patterns in a fixed order and picks the first pattern whose regex
matches at the current position, consuming that matched prefix. When no pattern matches the
current position, the lexer throws an error. The result is a flat list of lexemes that the parser
consumes next.

Copilot uses AI. Check for mistakes.

## Regular Languages

> **Definition — Regular language:**
> A language L ⊆ Σ* is *regular* if it is recognized by a finite automaton (FA). Equivalently,
> L can be described by a regular expression over alphabet Σ.
> Each token class defines a regular language: `NUMBER` defines the set
> { "0", "1", ..., "3.14", "100", ... }.

Regex notation is a concise way to specify regular languages. This is why regex is the right
tool for token class definitions — token classes have a "look ahead a bounded amount" structure
that regular languages capture exactly. More complex patterns such as balanced parentheses
require a more powerful formalism (context-free grammars, which the parser handles), but for
token recognition, regular expressions are both necessary and sufficient.

## NFA and DFA: The Conceptual Picture

Any regular expression can be translated into a finite automaton that accepts the same strings.
The standard construction proceeds in two steps.

**Step 1 — NFA (nondeterministic finite automaton).** A regex is converted into an NFA via
Thompson's construction. An NFA can have multiple possible transitions from a state on the same
input, or transitions on the empty string. For simple patterns this is easy to visualize. The
`PLUS` token pattern `\+` produces a two-state NFA:

| State | Input `+` | Accept? |
|-------|-----------|---------|
| q₀ | q₁ | No |
| q₁ | — | Yes |

The machine starts at q₀, consumes a `+`, and moves to q₁ — an accepting state. Any other
input from q₀ leads nowhere, meaning the string does not match.

**Step 2 — DFA (deterministic finite automaton).** An NFA is then converted to a DFA. A DFA
has exactly one transition per (state, input-character) pair, with no ambiguity. This matters
for performance: a DFA can be executed in O(n) time by reading the input left to right, one
character at a time, following the single applicable transition at each step. A DFA is therefore
the right runtime data structure for a lexer — no backtracking, no branching.

> **Definition — Deterministic Finite Automaton (DFA):**
> A DFA is a 5-tuple (Q, Σ, δ, q₀, F) where:
> - Q is a finite set of states
> - Σ is the input alphabet (here: Unicode characters)
> - δ : Q × Σ → Q is the transition function
> - q₀ ∈ Q is the start state
> - F ⊆ Q is the set of accepting states
>
> A DFA accepts a string w if δ*(q₀, w) ∈ F, where δ* is the iterated transition function.
> In Alpaca's combined lexer DFA, each accepting state also carries a *token label* indicating
> which token class was matched.

## Combining Token Patterns into One Automaton

To lex a language with multiple token classes, the standard approach builds one combined DFA. In
theory: construct an NFA for each token pattern, connect them all to a new start state with
epsilon transitions, then convert the combined NFA to a single DFA.

Alpaca follows the same principle but implements it using Java's regex engine, which is itself
backed by NFA/DFA machinery:

Comment on lines +66 to +68
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java’s java.util.regex.Pattern is a backtracking regex engine; it’s not a DFA execution model and can have super-linear worst-case behavior for certain patterns. Consider rephrasing this paragraph to avoid implying DFA semantics/guarantees from using Java regexes.

Copilot uses AI. Check for mistakes.
- All token patterns are combined into a single Java regex alternation at compile time:

```
// Conceptual: how Alpaca combines patterns internally
(?<NUMBER>[0-9]+(\.[0-9]+)?)|(?<PLUS>\+)|(?<MINUS>-)|(?<TIMES>\*)|...
```

- `java.util.regex.Pattern.compile(...)` is called inside the `lexerImpl` macro at compile
time. An invalid regex pattern therefore causes a compile error, not a runtime crash.
- At runtime, `Tokenization.tokenize()` uses `matcher.lookingAt()` on the combined pattern at
the current input position. It then checks which named group matched using
`matcher.start(i)` to determine the token class.

This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
Comment on lines +82 to +83
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying this “means … O(n) … no backtracking” is not accurate for Java regex in general. It would be safer to describe the implementation (combined pattern + lookingAt) without promising DFA-like time guarantees.

Suggested change
This means Alpaca's lexer runs with the same O(n) guarantee as a hand-built DFA: one pass
through the input, no backtracking.
In practice, this combined-pattern approach lets Alpaca's lexer scan the input from left to
right in a single pass, much like a hand-built DFA-based lexer. However, it still relies on
Java's backtracking regex engine internally, so Alpaca does not claim a strict worst-case O(n)
time guarantee or the complete absence of backtracking for arbitrary token patterns.

Copilot uses AI. Check for mistakes.

## Shadowing Detection

A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every
string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's
language is also in A's language), and A appears before B in the lexer definition. If this
occurs, B will never match — it is dead code.

Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex
operations) to check at compile time whether any pattern's language is a subset of an earlier
pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a
compile error pointing to the offending patterns.

**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer
pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this
Comment on lines +87 to +100
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shadowing definition here uses plain language subset L(B) ⊆ L(A), but Alpaca’s implementation checks subset on pattern + ".*" (prefix languages) because matching uses lookingAt and doesn’t require a full-string match. Consider updating the definition to reflect this prefix-based notion so the described behavior matches the actual checker.

Suggested change
A practical issue with ordered alternation is *shadowing*: pattern A shadows pattern B if every
string matched by B is also matched by A (that is, L(B) ⊆ L(A), meaning every string in B's
language is also in A's language), and A appears before B in the lexer definition. If this
occurs, B will never match — it is dead code.
Alpaca's `RegexChecker` uses the `dregex` library (a Scala/JVM library for decidable regex
operations) to check at compile time whether any pattern's language is a subset of an earlier
pattern's language. If shadowing is detected, the macro throws a `ShadowException` with a
compile error pointing to the offending patterns.
**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
`"3.14"` is also matched by `"[0-9]+"` up to the decimal point, but more critically the integer
pattern can match the prefix `"3"` and would consume it first. The `dregex` check catches this
A practical issue with ordered alternation is *shadowing*. Intuitively, pattern A shadows pattern
B if, whenever B could match starting at some input position, A can also match some (possibly
shorter) prefix there, and A appears before B in the lexer definition. In that situation B will
never be the pattern that the lexer chooses — it is effectively dead code.
Formally, because Alpaca's lexer uses `matcher.lookingAt()`, matching is prefix-based rather than
whole-string based. Alpaca's `RegexChecker` therefore uses the `dregex` library (a Scala/JVM
library for decidable regex operations) to check at compile time whether the *prefix-extended*
language of a later pattern (conceptually its language with `".*"` appended) is a subset of the
prefix-extended language of an earlier pattern. If such a subset relation holds, the macro throws
a `ShadowException` with a compile error pointing to the offending patterns.
**Example:** If you wrote the integer pattern `"[0-9]+"` before the decimal pattern
`"[0-9]+(\\.[0-9]+)?"`, the integer pattern would shadow the decimal one — every decimal like
`"3.14"` is also matched by `"[0-9]+"` on its initial digits, and the integer pattern can match
the prefix `"3"` and would consume it first. The prefix-based `dregex` check catches this

Copilot uses AI. Check for mistakes.
ordering mistake at compile time rather than silently producing wrong output at runtime.

In `CalcLexer`, the decimal pattern `"[0-9]+(\\.[0-9]+)?"` is listed first, before any simpler
integer-only pattern, so no shadowing occurs.

> **Compile-time processing:** The `lexer` macro validates all regex patterns, combines them into a single alternation pattern, and checks for shadowing using `dregex` — all at compile time. If a regex is invalid or one pattern shadows another, you get a compile error. At runtime, the generated `Tokenization` object runs the pre-compiled combined regex against your input string.

## Cross-links

- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cross-link points at ../lexer.md, but existing docs link to the generated *.html pages. Consider switching to ../lexer.html so the link works in the rendered ScalaDoc output.

Suggested change
- See [Lexer](../lexer.md) for the complete `lexer` DSL reference.
- See [Lexer](../lexer.html) for the complete `lexer` DSL reference.

Copilot uses AI. Check for mistakes.
- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link points at tokens.md, but the generated site uses .html pages. Consider linking to tokens.html for a working cross-link in the rendered docs.

Suggested change
- See [Tokens and Lexemes](tokens.md) for what the lexer produces — the lexeme stream.
- See [Tokens and Lexemes](tokens.html) for what the lexer produces — the lexeme stream.

Copilot uses AI. Check for mistakes.
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link theory/cfg.md is broken from this page (it would resolve to theory/theory/cfg.md), and there is no CFG page under docs/_docs/theory/. Add the missing page or update this cross-link to an existing target.

Suggested change
- Next: [Context-Free Grammars](theory/cfg.md) for how token streams are parsed.
- Next: [Context-Free Grammars](cfg.md) for how token streams are parsed.

Copilot uses AI. Check for mistakes.
88 changes: 88 additions & 0 deletions docs/_docs/theory/pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# The Compilation Pipeline

Source text is just a string. A compiler pipeline is a sequence of transformations that turns that string into something structured and meaningful. Each stage takes the output of the previous one, narrowing the representation from raw text to a typed result.

Understanding the pipeline gives you a mental model that applies to every Alpaca program you write — not just calculator expressions, but any language you define with the library.

## The Four Stages

Most compilers share the same four-stage structure:

1. **Source text** — the raw input string, e.g., `"3 + 4 * 2"`
2. **Lexical analysis** — groups characters into tokens: `NUMBER(3.0)`, `PLUS`, `NUMBER(4.0)`, `TIMES`, `NUMBER(2.0)`
3. **Syntactic analysis** — arranges tokens into a parse tree (concrete syntax tree) that encodes grammatical structure
4. **Semantic analysis / evaluation** — extracts meaning from the tree, producing a typed result (in a calculator: `Double`)

Some compilers add a fifth stage — code generation — that emits machine code or bytecode. Alpaca stops at stage 4: its pipeline produces a typed Scala value, not machine code.

## Alpaca's Pipeline

With Alpaca, running the full pipeline takes two calls:

```scala sc:nocompile
// Full pipeline: source text → typed result
val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2")
// lexemes: List[Lexeme] — NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0)

val (_, result) = CalcParser.parse(lexemes)
// result: Double | Null = 11.0
```

`CalcLexer.tokenize` handles stages 1–2: it takes the source string and produces a `List[Lexeme]`. `CalcParser.parse` handles stages 3–4: it takes those lexemes, builds the parse tree internally, and returns the typed result.

Comment on lines +31 to +32
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokenize does not return List[Lexeme] directly; it returns a named tuple (ctx, lexemes) and callers typically destructure to get the list. To match the rest of the docs and the actual API, consider rewording these lines to reflect the tuple return type.

Copilot uses AI. Check for mistakes.
Both `CalcLexer` and `CalcParser` are objects generated by Alpaca's macros. Their definitions live in separate files (see the cross-links at the bottom of this page).

## Compile-time vs Runtime Boundary

Alpaca draws a sharp line between what happens at compile time and what happens at runtime. This is the most important thing to understand about the library.

> **Compile-time processing:** When you write a `lexer` definition, the Scala 3 macro validates your regex patterns, checks for shadowing, and generates the `Tokenization` object. When you write a `Parser` definition, the macro reads your grammar, builds the LR(1) parse table, and detects any shift/reduce conflicts — all at compile time. At runtime, `tokenize(input)` and `parse(lexemes)` execute the pre-generated code.

In concrete terms:

**Compile time:**
- The `lexer` macro validates regex patterns, detects shadowing (where one pattern makes another unreachable), and emits a `Tokenization` object
- The `Parser` macro reads every `Rule` declaration, constructs the LR(1) parse table, and reports any shift/reduce or reduce/reduce conflicts as compile errors

**Runtime:**
- `tokenize(input)` executes the pre-generated code and returns `List[Lexeme]`
- `parse(lexemes)` executes the pre-built parse table and returns the typed result

The consequence: if your regex is invalid, or your grammar is ambiguous, you get a compile error — not a runtime crash. The pipeline is safe by construction before it ever runs on real input.

Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement contradicts earlier text on the page that CalcParser.parse “handles stages 3–4” and that Alpaca “stops at stage 4”. If semantic analysis/evaluation happens via parser semantic actions, Alpaca effectively covers stage 4 as well; please reconcile the stage numbering here for consistency.

Suggested change
Alpaca covers stages 1–3 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.
Alpaca covers stages 1–4 of the classical pipeline. The "code generation" stage is not part of the library — your Scala semantic actions in the parser rules produce the final typed value directly.

Copilot uses AI. Check for mistakes.

## Formal Definition

> **Definition — Compilation pipeline:**
> A compiler pipeline is a composition of transformations f₁ ∘ f₂ ∘ ... ∘ fₙ where each fᵢ maps the output of fᵢ₋₁ to a more structured representation.
> Alpaca's pipeline: `parse ∘ tokenize : String → R` where R is the root non-terminal's result type.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formal signature parse ∘ tokenize : String → R doesn’t match the API described elsewhere on the page (parse yields R | Null, and both tokenize/parse return (ctx, ...)). Consider adjusting the definition to something like String → (Ctx, R | Null) or explicitly stating that you’re projecting away the ctx and Null cases for the mathematical abstraction.

Suggested change
> Alpaca's pipeline: `parse ∘ tokenize : String → R` where R is the root non-terminal's result type.
> Alpaca's pipeline (projecting away the threaded context and the `Null` failure case) can be viewed as `parse ∘ tokenize : String → R`, where `R` is the root non-terminal's result type.

Copilot uses AI. Check for mistakes.

For the calculator example, `R` is `Double`. For a JSON parser, `R` might be `Any` or a custom AST type. The pipeline shape is always the same; only the result type changes.

The parser internally appends a special `Lexeme.EOF` marker to the lexeme list before running the shift/reduce loop. This is an implementation detail — you do not need to add it yourself.

## Mapping the Stages to Alpaca Types

Each pipeline stage corresponds to a concrete Alpaca type:

| Stage | Input | Output | Alpaca Type |
|-------|-------|--------|-------------|
| Source text | — | `String` | `String` (plain Scala) |
| Lexical analysis | `String` | token stream | `List[Lexeme]` |
| Syntactic analysis | `List[Lexeme]` | parse tree (internal) | LR(1) stack (internal) |
| Semantic analysis | parse tree | typed result | `R \| Null` (your root type) |

The parse tree is never exposed directly — Alpaca builds it internally and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions). What you get back from `parse` is the final typed value, not an intermediate tree.

## What Comes Next

The rest of the Compiler Theory Tutorial builds on this mental model:

- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
Comment on lines +82 to +83
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These “What Comes Next” links use theory/... from within the theory/ directory, so they resolve to theory/theory/... and will be broken. Also, the docs site generates .html pages, so these should likely link to tokens.html and lexer-fa.html (no extra theory/ prefix).

Suggested change
- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
- Next: [Tokens & Lexemes](tokens.html) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](lexer-fa.html) — how regular expressions define token classes and how Alpaca compiles them

Copilot uses AI. Check for mistakes.

For the full API, see the reference pages:

- See [Lexer](lexer.md) for how `CalcLexer` is defined.
- See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
Comment on lines +82 to +88
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These reference links likely won’t work in the generated ScalaDoc site as written: from theory/, they should be relative to the parent directory, and point at the generated .html pages (e.g. ../lexer.html, ../parser.html) rather than .md sources.

Suggested change
- Next: [Tokens & Lexemes](theory/tokens.md) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](theory/lexer-fa.md) — how regular expressions define token classes and how Alpaca compiles them
For the full API, see the reference pages:
- See [Lexer](lexer.md) for how `CalcLexer` is defined.
- See [Parser](parser.md) for how `CalcParser` is defined and how grammar rules produce a typed result.
- Next: [Tokens & Lexemes](../tokens.html) — what the lexer produces: token classes, token instances, and how they are represented in Alpaca
- [The Lexer: Regex to Finite Automata](../lexer-fa.html) — how regular expressions define token classes and how Alpaca compiles them
For the full API, see the reference pages:
- See [Lexer](../lexer.html) for how `CalcLexer` is defined.
- See [Parser](../parser.html) for how `CalcParser` is defined and how grammar rules produce a typed result.

Copilot uses AI. Check for mistakes.
123 changes: 123 additions & 0 deletions docs/_docs/theory/tokens.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Tokens and Lexemes

A lexer transforms raw source text into a sequence of structured tokens — the first stage of
compilation. Before writing a lexer, it helps to understand the formal vocabulary: what a token
class is, how individual matches relate to it, and what a lexeme carries.

## Terminal Symbols

In formal grammar, a *terminal symbol* is an atomic element that cannot be broken down further.
It is the end of the line for derivation — terminals represent the actual characters or strings
that appear in source text. In a lexer, each token class acts as a terminal: it names a category
of strings, and no lexer-level expansion applies below it.

See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link target theory/cfg.md is both an incorrect relative path from this page (it would resolve to theory/theory/cfg.md) and there is no cfg.md page under docs/_docs/theory/. Add the missing CFG page or update/remove this link so it resolves correctly.

Suggested change
See [Context-Free Grammars](theory/cfg.md) for how terminals fit into production rules.
See the discussion of context-free grammars for how terminals fit into production rules.

Copilot uses AI. Check for mistakes.

## Token Classes vs Token Instances

It is useful to distinguish three levels:

**Token class** — defines a category of strings by a regular expression. For example, the
`NUMBER` class matches any string of the form `[0-9]+(\.[0-9]+)?`: the integers `"3"`, `"42"`,
and the decimals `"3.14"`, `"0.5"`.

**Token instance** — a specific string found in the input that belongs to a token class. When
the lexer scans `"3 + 4"`, it finds three token instances: the string `"3"` (a NUMBER), the
string `"+"` (a PLUS), and the string `"4"` (another NUMBER).

**Lexeme** — the full record of a token instance: the token class, the matched text, and its
position in the source. Parsing `"3 + 4"` produces three lexemes:

- `NUMBER("3", pos=0)`
- `PLUS("+", pos=2)`
- `NUMBER("4", pos=4)`
Comment on lines +31 to +33
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example lexeme positions here are 0-based absolute offsets (pos=0/2/4), but Alpaca’s default lexer context tracks position as a 1-based column within the current line (and it’s updated after each match). Consider aligning these examples/terminology with Alpaca’s actual position semantics to avoid confusion.

Copilot uses AI. Check for mistakes.

The word *lexeme* is used throughout this documentation to mean this complete record.

> **Definition — Lexeme:**
> A *lexeme* is a triple (T, w, pos) where T is a token class, w ∈ L(T) is the matched string
> (a member of the language defined by T's regex), and pos is the position of the end of the
> match in the source text.
> In Alpaca: `Lexeme[Name, Value]` where `Name` is the token class name (a string literal type)
> and `Value` is the Scala type of the extracted value.

## Alpaca's Lexeme Type

In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
pieces of information:

- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
for PLUS
Comment on lines +46 to +51
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says a lexeme carries “four pieces of information”, but Alpaca lexemes also include the matched string as lexeme.text (and more generally expose all lexer context fields via dynamic selection). Consider updating this description to include text / clarify that additional fields come from the lexer context.

Suggested change
In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries four
pieces of information:
- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
for PLUS
In Alpaca, each matched token is represented as a `Lexeme[Name, Value]`. A lexeme carries five
core pieces of information:
- `name` — the token class name string, e.g., `"NUMBER"` or `"PLUS"`
- `value` — the extracted value with its Scala type, e.g., `3.14: Double` for NUMBER, `(): Unit`
for PLUS
- `text` — the matched source substring (also available as `lexeme.text`)

Copilot uses AI. Check for mistakes.
- `position` — the character offset at the end of the match
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the default LexerCtx, position is a 1-based column within the current line (updated after each match), not an absolute character offset into the entire source. Rewording this bullet would better match the actual meaning of position in Alpaca.

Copilot uses AI. Check for mistakes.
- `line` — the line number at the end of the match

The tokenization output for a simple expression illustrates this:

```scala sc:nocompile
val (_, lexemes) = CalcLexer.tokenize("3 + 4 * 2")
// lexemes: List[Lexeme] =
// NUMBER(3.0), PLUS, NUMBER(4.0), TIMES, NUMBER(2.0)
//
// Each Lexeme carries:
// .name — token class name (e.g., "NUMBER")
// .value — extracted value (e.g., 3.0: Double)
// .position — character offset at end of match
// .line — line number at end of match
```

Whitespace matches `Token.Ignored` and does not produce a lexeme — it disappears from the stream.

## CalcLexer Token Class Table

The `CalcLexer` running example defines seven token classes:

| Token Class | Regex Pattern | Value Type | Example Match |
|-------------|--------------|------------|---------------|
| `NUMBER` | `[0-9]+(\.[0-9]+)?` | `Double` | `"3.14"` → `3.14` |
| `PLUS` | `\+` | `Unit` | `"+"` |
| `MINUS` | `-` | `Unit` | `"-"` |
| `TIMES` | `\*` | `Unit` | `"*"` |
| `DIVIDE` | `/` | `Unit` | `"/"` |
| `LPAREN` | `\(` | `Unit` | `"("` |
| `RPAREN` | `\)` | `Unit` | `")"` |

Whitespace is ignored (`Token.Ignored`) and does not appear in the lexeme stream.

`NUMBER` is the only value-bearing token: the macro uses the `@` binding to convert the matched
string to a `Double`. The remaining six tokens carry `Unit` — their presence in the stream is
enough; no value needs to be extracted.

## Full CalcLexer Definition

The canonical CalcLexer definition, which appears throughout the documentation as a running
example:

```scala sc:nocompile
import alpaca.*

val CalcLexer = lexer:
case num @ "[0-9]+(\\.[0-9]+)?" => Token["NUMBER"](num.toDouble)
case "\\+" => Token["PLUS"]
case "-" => Token["MINUS"]
case "\\*" => Token["TIMES"]
case "/" => Token["DIVIDE"]
case "\\(" => Token["LPAREN"]
case "\\)" => Token["RPAREN"]
case "\\s+" => Token.Ignored
```

Each `case` arm maps a Java regex pattern to a token constructor. Patterns are tested in order;
the first match wins. The `num @` binding in the first arm captures the matched text as a
`String`, which `num.toDouble` converts to a `Double` before it is stored in the lexeme.

This definition uses `sc:nocompile` because `lexer` is a Scala 3 macro: the macro runs at
compile time, validates all regex patterns, checks for shadowing, and generates the
`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the
macro does internally.

## Cross-links

- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the rest of the docs (which link to *.html pages), this cross-link should likely be ../lexer.html rather than ../lexer.md so it works in the generated site.

Suggested change
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [Lexer](../lexer.html) for the full `lexer` DSL reference and all token forms.

Copilot uses AI. Check for mistakes.
- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token
Comment on lines +116 to +122
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link points at the Markdown source (lexer-fa.md), but the generated documentation site uses .html pages (as in the rest of docs/_docs). Consider linking to lexer-fa.html so the cross-link works in the rendered site.

Suggested change
`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.md) for what the
macro does internally.
## Cross-links
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [The Lexer: Regex to Finite Automata](lexer-fa.md) for how regex patterns define token
`Tokenization` object. See [The Lexer: Regex to Finite Automata](lexer-fa.html) for what the
macro does internally.
## Cross-links
- See [Lexer](../lexer.md) for the full `lexer` DSL reference and all token forms.
- See [The Lexer: Regex to Finite Automata](lexer-fa.html) for how regex patterns define token

Copilot uses AI. Check for mistakes.
classes formally.