Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions docs/_docs/theory/cfg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
Context-free grammars are the backbone of syntactic analysis. A grammar defines a language by specifying how symbols can be combined and rewritten — "context-free" means each rule applies regardless of surrounding context. If the lexer is the vocabulary of a language, the grammar is its syntax.

## What is a Context-Free Grammar?

A grammar consists of a set of non-terminal symbols (grammar variables that can be expanded), a set of terminal symbols (the tokens the lexer produces), a set of production rules (rewrite rules), and a start symbol. A derivation starts from the start symbol and repeatedly replaces non-terminals with production right-hand sides until only terminals remain. The language of a grammar G is the set of all terminal strings reachable from the start symbol.

> **Definition — Context-Free Grammar:**
> A CFG is a 4-tuple G = (V, Σ, R, S) where:
> - V is a finite set of non-terminal symbols (grammar variables)
> - Σ is a finite set of terminal symbols (tokens), V ∩ Σ = ∅
> - R ⊆ V × (V ∪ Σ)* is a finite set of production rules
> - S ∈ V is the start symbol
>
> A production rule A → α means the non-terminal A can be replaced by the symbol string α.
> A grammar generates the language L(G) = { w ∈ Σ* | S ⇒* w } — all terminal strings
> derivable from S in zero or more steps.

## BNF Notation

Production rules are written in Backus-Naur Form (BNF): `A → α` means A can be rewritten as α. The vertical bar `|` separates alternatives, so `A → α | β` is shorthand for two rules. Non-terminals are written in CamelCase; terminals are UPPERCASE (matching Alpaca's token name conventions).

EBNF (Extended BNF) adds optional elements `[...]`, repetition `{...}`, and grouping `(...)`. These shorthands can always be translated into plain BNF, but are useful for compact notation. This page uses BNF throughout for clarity; Alpaca's DSL maps directly to BNF productions.

## The Calculator Grammar

The calculator grammar is the running example for the entire Compiler Theory Tutorial. It defines arithmetic expressions with four operators and parentheses:

```
Expr → Expr PLUS Expr
| Expr MINUS Expr
| Expr TIMES Expr
| Expr DIVIDE Expr
| LPAREN Expr RPAREN
| NUMBER

root → Expr
```

Identifying the 4-tuple components:

- V = {Expr, root} — two non-terminals
- Σ = {NUMBER, PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN} — seven terminal symbols, produced by CalcLexer
- R = the 7 production rules above
- S = root — the start symbol

Note: this grammar is **ambiguous** — the expression `1 + 2 * 3` can be parsed in two ways depending on which `Expr` is expanded first. We will see how Alpaca resolves ambiguities on the [Conflict Resolution](../conflict-resolution.md) page.

## Derivation

A *derivation* is a sequence of rewriting steps from the start symbol to a terminal string. Each step replaces the leftmost non-terminal with one of its production alternatives (leftmost derivation).

Leftmost derivation for `1 + 2`:

```
root ⇒ Expr
⇒ Expr PLUS Expr (apply: Expr → Expr PLUS Expr)
⇒ NUMBER PLUS Expr (apply: Expr → NUMBER, leftmost)
⇒ NUMBER PLUS NUMBER (apply: Expr → NUMBER, leftmost)
```

The first step applies `root → Expr`; the second expands the leftmost `Expr` using the `Expr PLUS Expr` production; the third and fourth substitute the literal `NUMBER` terminal for each remaining `Expr`.

## Parse Trees

A parse tree captures the grammatical structure of a derivation as a tree. Each internal node is a non-terminal; each leaf is a terminal. The parse tree for `1 + 2`:

```
root
|
Expr
/ | \
Expr PLUS Expr
| |
NUMBER NUMBER
(1.0) (2.0)
```

Note: In Alpaca, the parse tree is never exposed to user code. The `Parser` macro builds it internally during the shift-reduce parse, and immediately evaluates your semantic actions (the `=>` expressions in `rule` definitions) as each node is reduced. What `parse()` returns is the typed result — a `Double` in the calculator case — not an intermediate tree object. (See [The Compilation Pipeline](pipeline.md) for the full picture.)

## Alpaca DSL Mapping

The calculator grammar maps directly to an Alpaca `Parser` definition. Each production rule becomes a case clause in a `rule(...)` call; the right-hand side pattern matches the grammatical structure, and the right-hand side expression computes the result.

```scala sc:nocompile
import alpaca.*

object CalcParser extends Parser:
val Expr: Rule[Double] = rule(
{ case (Expr(a), CalcLexer.PLUS(_), Expr(b)) => a + b },
{ case (Expr(a), CalcLexer.MINUS(_), Expr(b)) => a - b },
{ case (Expr(a), CalcLexer.TIMES(_), Expr(b)) => a * b },
{ case (Expr(a), CalcLexer.DIVIDE(_), Expr(b)) => a / b },
{ case (CalcLexer.LPAREN(_), Expr(e), CalcLexer.RPAREN(_)) => e },
{ case CalcLexer.NUMBER(n) => n.value },
)
val root: Rule[Double] = rule:
case Expr(v) => v
```

Each `case` clause corresponds to one production rule. `Expr(a)` matches a reduced `Expr` non-terminal with value `a`. `CalcLexer.PLUS(_)` matches the PLUS terminal (the `_` discards the lexeme value since PLUS carries `Unit`). `CalcLexer.NUMBER(n)` matches a NUMBER terminal; `n.value` accesses the `Double` extracted by the lexer. The grammar's non-terminals (`Expr`, `root`) become `Rule[Double]` values; the type parameter is the result type of each reduction.

> **Compile-time processing:** When you define `object CalcParser extends Parser`, the Alpaca macro reads every `rule` declaration and constructs the LR(1) parse table at compile time.

## Cross-links

- See [Tokens and Lexemes](tokens.md) for how the terminal symbols (NUMBER, PLUS, etc.) are produced by the lexer.
- Next: [Why LR?](why-lr.md) — why LR parsing was chosen over top-down alternatives.
- See [Parser](../parser.md) for the complete `rule` DSL reference and all extractor forms.
- See [Conflict Resolution](../conflict-resolution.md) for how Alpaca resolves ambiguity in the calculator grammar.
98 changes: 98 additions & 0 deletions docs/_docs/theory/conflicts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
A grammar is ambiguous if a string can be parsed in more than one way. In LR parsing, ambiguity manifests as a conflict: the parse table has two valid entries for the same (state, symbol) pair, and the parser cannot proceed deterministically.

## What is a Parse Table Conflict?

The LR(1) parse table maps (state, lookahead terminal) pairs to actions — either Shift (push the next token) or Reduce (pop a production's right-hand side and produce a non-terminal). A conflict exists when a single (state, terminal) pair has more than one valid action: the parse table has a collision.

> **Definition — Parse Table Conflict:**
> A conflict in parse state s exists when the parse table has more than one entry
> for the pair (s, t) for some lookahead terminal t ∈ Σ ∪ {$}.
> A shift/reduce conflict has one entry Shift(s') and one entry Reduce(A → α).
> A reduce/reduce conflict has two entries Reduce(A → α) and Reduce(B → β).

## Shift/Reduce Conflicts

At some parse state, given lookahead token t, the parser could either shift t (push it and move to a new state) or reduce by some production A → α (pop the right-hand side and produce A). Both are valid actions for the same (state, t) pair — the parser cannot decide between them deterministically.

Why it happens: two or more LR(1) items in the same state propose incompatible actions for the same lookahead. The grammar allows the same prefix to continue in two different ways, and the LR automaton sees both paths simultaneously.

**Example: `1 + 2 + 3` in the calculator grammar.** After parsing `Expr PLUS Expr` with lookahead `PLUS`, the parser has two valid choices:

- **Reduce** `Expr → Expr PLUS Expr` — complete the first addition and produce a single `Expr`.
- **Shift** the second `PLUS` — keep accumulating, treating the input as `1 + (2 + 3)`.

Both are valid parse trees for `1 + 2 + 3` — the grammar (from [cfg.md](cfg.md)) is ambiguous for binary operator chains. Alpaca detects this conflict at compile time and reports:

```
Shift "PLUS ($plus)" vs Reduce Expr -> Expr PLUS ($plus) Expr
In situation like:
Expr PLUS ($plus) Expr PLUS ($plus) ...
Consider marking production Expr -> Expr PLUS ($plus) Expr to be alwaysBefore or alwaysAfter "PLUS ($plus)"
```

> **Note:** The error message says `alwaysBefore`/`alwaysAfter`. These method names do not exist in the Alpaca API. The correct methods are `before` and `after`. See [Conflict Resolution](../conflict-resolution.md) for full details on reading error messages.

## Reduce/Reduce Conflicts

A reduce/reduce conflict occurs when two different productions can reduce the same token sequence with the same lookahead. The parser has two Reduce entries for the same (state, t) pair and cannot decide which to apply.

**Example:** if a grammar has both `Integer → NUMBER` and `Float → NUMBER`, and the parser has `NUMBER` on the stack with lookahead `$`, it cannot determine which reduction to apply — both are valid. Alpaca reports:

```
Reduce Integer -> Number vs Reduce Float -> Number
In situation like:
Number ...
Consider marking one of the productions to be alwaysBefore or alwaysAfter the other
```

> **Note:** The error message says `alwaysBefore`/`alwaysAfter`. These method names do not exist in the Alpaca API. The correct methods are `before` and `after`. See [Conflict Resolution](../conflict-resolution.md) for full details on reading error messages.

Reduce/reduce conflicts are less common than shift/reduce conflicts. They typically indicate a grammar design issue — two rules competing for the same token sequence. The usual fix is to restructure the grammar so the two competing productions have distinct right-hand sides, or to use a different non-terminal.

## How LR(1) Lookahead Helps

LR(1) lookahead often disambiguates conflicts that earlier LR variants (LR(0), SLR) cannot resolve. Each item in the LR(1) item set carries its specific lookahead terminal, so the parser only fires a reduce when the actual next token matches that item's lookahead. This eliminates many spurious conflicts.

But for inherently ambiguous grammars — like the calculator's binary operator productions — LR(1) lookahead alone is not enough. The grammar has the same prefix structure regardless of which associativity is intended, so both shift and reduce appear valid to the automaton. Explicit resolution is required.

For a detailed explanation of items and lookahead, see [Shift-Reduce Parsing](shift-reduce.md).

## Resolution by Priority

Resolving a conflict means declaring which action wins. For a shift/reduce conflict: should the reduction or the shift take priority?

Alpaca's `before`/`after` DSL lets you declare priorities directly in the parser definition:

- `production.name.before(tokens*)` — when the conflict is between reducing `name` and shifting one of those tokens, the reduction wins. Use this for left-associativity and higher-precedence reductions.
- `production.name.after(tokens*)` — prefer shifting those tokens over reducing this production. Use this when another operator should bind more tightly.

Priorities are transitive via BFS: if reducing `times` beats shifting `PLUS`, and reducing `plus` beats shifting `MINUS`, then the precedence relationships propagate through the graph.

A minimal example — declaring left-associativity and precedence for the `plus` production only:

```scala sc:nocompile
import alpaca.*

override val resolutions = Set(
production.plus.before(CalcLexer.PLUS, CalcLexer.MINUS), // left-associative: reduce + before shifting + or -
production.plus.after(CalcLexer.TIMES, CalcLexer.DIVIDE), // lower precedence: shift * or / before reducing +
)
```

The complete CalcParser resolution set — including `minus`, `times`, and `div` — is shown on [Full Calculator Example](full-example.md). For the full DSL reference (Production(symbols*) selector, token-side resolution, cycle detection, ordering constraint), see [Conflict Resolution](../conflict-resolution.md).

## Compile-Time Detection

Conflicts are detected at compile time when the LR(1) parse table is constructed by the `extends Parser` macro. A conflict causes a compile error (`ShiftReduceConflict` or `ReduceReduceConflict`) — no conflict checking happens at runtime.

When you add `override val resolutions = Set(...)`, the macro incorporates your priority declarations into the table construction and re-checks for consistency. A cycle in your declarations (`InconsistentConflictResolution`) is also reported at compile time.

> **Compile-time processing:** Alpaca builds the LR(1) parse table when you define `object MyParser extends Parser`. Any conflict — shift/reduce or reduce/reduce — is reported as a compile error immediately, before your code runs. When you add `override val resolutions = Set(...)`, the macro incorporates your priority declarations into the table construction and re-checks for consistency.

## Cross-links

- [Context-Free Grammars](cfg.md) — the calculator grammar that produces these conflicts
- [Shift-Reduce Parsing](shift-reduce.md) — the parse table mechanics behind conflicts
- [Conflict Resolution](../conflict-resolution.md) — the full DSL reference: Production(symbols*) selector, named productions, token-side resolution, cycle detection, ordering constraint
- [Semantic Actions](semantic-actions.md) — what happens when a conflict-free reduction fires
- [Full Calculator Example](full-example.md) — the full CalcParser with conflict resolution applied
Loading