Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions docs/_docs/guides/conflict-resolution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Guide: Conflict Resolution

LR parsers, like the one generated by Alpaca, can encounter **conflicts** when the grammar is ambiguous. A conflict occurs when the parser has multiple valid actions for a given state and lookahead token.

Alpaca provides a declarative DSL to resolve these conflicts at compile-time, ensuring your parser is deterministic and behaves as expected.

## 1. Types of Conflicts

### Shift/Reduce Conflict
Occurs when the parser can either **shift** (move the current token to the stack) or **reduce** (apply a grammar rule to symbols already on the stack).

**Example (Dangling Else):**
```scala
if (cond) if (cond) stmt else stmt
```
Should the `else` belong to the first `if` or the second?

### Reduce/Reduce Conflict
Occurs when the parser has two or more different rules that could be applied to the same symbols on the stack. This usually indicates a more serious ambiguity in the grammar.

## 2. Resolving Conflicts in Alpaca

Conflicts are resolved by overriding the `resolutions` member in your `Parser` object. This member is a `Set` of `ConflictResolution` rules.

```scala
object MyParser extends Parser:
// ... rules ...

override val resolutions = Set(
// Resolve conflicts here
)
```

### The `before` and `after` Operators

Alpaca uses `before` and `after` to define precedence:

- **`A after B`**: `A` has **higher precedence** than `B`. `A` will be reduced **later** than `B` (or `B` will be shifted while `A` is waiting).
- **`A before B`**: `A` has **lower precedence** than `B`. `A` will be reduced **earlier** than `B`.

You can use these operators with both **Tokens** (Terminals) and **Productions** (Rules).

#### Operator Precedence Example
```scala
override val resolutions = Set(
MyLexer.STAR after MyLexer.PLUS,
MyLexer.SLASH after MyLexer.PLUS
)
```
This tells the parser that `*` and `/` have higher precedence than `+`.

#### Associativity Example
To define **left-associativity** (e.g., `1 + 2 + 3` is `(1 + 2) + 3`), a production should be "before" its own recursive tokens.

```scala
val Expr: Rule[Int] = rule(
"add" { case (Expr(l), MyLexer.PLUS(_), Expr(r)) => l + r }
)

override val resolutions = Set(
production.add before MyLexer.PLUS
)
```

## 3. Named Productions

For fine-grained control, you can assign names to specific productions within a rule. This allows you to reference that exact production in your `resolutions`.

```scala
val Expr: Rule[Int] = rule(
"mul" { case (Expr(l), MyLexer.STAR(_), Expr(r)) => l * r },
"div" { case (Expr(l), MyLexer.SLASH(_), Expr(r)) => l / r },
{ case MyLexer.NUM(n) => n.value }
)

override val resolutions = Set(
production.mul after MyLexer.PLUS,
production.div after MyLexer.PLUS
)
```

## 4. Understanding Conflict Errors

When Alpaca detects an unresolved conflict, it produces a detailed compile-time error message.

**Example Error:**
```text
Shift "+ ($plus)" vs Reduce Expr -> Expr + ($plus) Expr
In situation like:
Expr + ($plus) Expr + ($plus) ...
Consider marking production Expr -> Expr + ($plus) Expr to be alwaysBefore or alwaysAfter "+ ($plus)"
```

### How to read it:
1. **The Conflict**: It tells you exactly which actions are clashing (Shift `+` vs Reduce `Expr`).
2. **The Context**: "In situation like..." shows you a sample input sequence that triggers the ambiguity.
3. **The Suggestion**: It suggests which production and token need a resolution rule.

## 5. Debugging with Graphs

If enabled in [Debug Settings](../debug-settings.html), Alpaca can generate internal representations of the parse table and conflict resolutions to help you visualize complex grammars.

## Best Practices

1. **Keep it Minimal**: Only add resolutions for actual conflicts reported by the compiler.
2. **Use Named Productions**: They make your intentions clearer than referencing raw tokens or entire rules.
3. **Think in Terms of Trees**: Remember that "higher precedence" (`after`) means the operation will appear **lower** in the resulting AST (it stays on the stack longer).
134 changes: 134 additions & 0 deletions docs/_docs/guides/contextual-parsing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Guide: Contextual Parsing

Contextual parsing refers to the ability of a lexer or parser to change its behavior or maintain state based on the input it has already seen. Alpaca provides powerful mechanisms for this through `LexerCtx`, `ParserCtx`, and the `BetweenStages` hook.

## 1. Lexer-Level Context

Most contextual logic in Alpaca happens at the lexer level. Since the lexer tokenizes the entire input before the parser starts, the lexer context is the primary place to track state that affects tokenization.

### Pattern: Brace Matching & Nesting

You can use a context to track nesting levels of braces, parentheses, or brackets.

```scala
import alpaca.*
import scala.collection.mutable.Stack

case class BraceCtx(
var text: CharSequence = "",
val stack: Stack[String] = Stack()
) extends LexerCtx

val myLexer = lexer[BraceCtx]:
case "\(" =>
ctx.stack.push("paren")
Token["("]
case "\)" =>
Comment on lines +23 to +26
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lexer patterns use "\(" and "\)" (single backslash). In Scala string literals \( / \) are invalid escape sequences; if the intent is to match literal parentheses in a regex, the strings should be escaped as "\\(" and "\\)" (or written using triple-quoted strings).

Suggested change
case "\(" =>
ctx.stack.push("paren")
Token["("]
case "\)" =>
case "\\(" =>
ctx.stack.push("paren")
Token["("]
case "\\)" =>

Copilot uses AI. Check for mistakes.
if ctx.stack.isEmpty || ctx.stack.pop() != "paren" then
throw RuntimeException("Mismatched parenthesis")
Token[")"]
```

### Pattern: Indentation-Based Parsing

For languages like Python or Scala (with `braceless` syntax), you can track indentation levels in the context and emit virtual `INDENT` and `OUTDENT` tokens (or just adjust state for later use).

```scala
case class IndentCtx(
var text: CharSequence = "",
var currentIndent: Int = 0,
val indents: Stack[Int] = Stack(0)
) extends LexerCtx

val pythonicLexer = lexer[IndentCtx]:
case x @ "
+" =>
Comment on lines +44 to +45
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation lexer example has a broken multi-line string literal for the newline+spaces pattern (case x @ " on one line and +" on the next). As written, this is not valid Scala and will be confusing to readers; represent the pattern as a valid single-line string (e.g., using \n and escaped backslashes) or a properly delimited triple-quoted string.

Suggested change
case x @ "
+" =>
case x @ "\\n +" =>

Copilot uses AI. Check for mistakes.
val newIndent = x.length - 1
if newIndent > ctx.currentIndent then
ctx.indents.push(newIndent)
ctx.currentIndent = newIndent
Token["INDENT"]
else if newIndent < ctx.currentIndent then
// ... logic to emit multiple OUTDENTs ...
Token["OUTDENT"]
else
Token.Ignored
```

## 2. Accessing Lexer Context in the Parser

Every `Lexeme` matched by the lexer carries a "snapshot" of the `LexerCtx` as it was at the moment that specific token was matched.

This is extremely useful for error reporting or for logic that depends on when a token appeared.

```scala
object MyParser extends Parser:
val root = rule:
case MyLexer.ID(id) =>
// id is a Lexeme, which has a .fields property
// fields contains all members of your LexerCtx
println(s"Matched ID at line ${id.fields.line}")
Comment on lines +68 to +70
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section states that a Lexeme has a .fields property and shows id.fields.line, but fields is not publicly accessible on alpaca.internal.lexer.Lexeme (it’s private[alpaca]). Readers should access captured context fields via the lexeme’s dynamic members (e.g., id.line, id.position, id.text) or whatever the intended public API is.

Suggested change
// id is a Lexeme, which has a .fields property
// fields contains all members of your LexerCtx
println(s"Matched ID at line ${id.fields.line}")
// id is a Lexeme; captured context fields are exposed as dynamic members
// e.g. if your LexerCtx has a `line` field, you can access it as `id.line`
println(s"Matched ID at line ${id.line}")

Copilot uses AI. Check for mistakes.
id.value
```

## 3. Parser-Level Context (`ParserCtx`)

`ParserCtx` is for maintaining state during the reduction process. This is where you build symbol tables, track variable declarations, or perform type checking.

```scala
case class MyParserCtx(var symbols: Map[String, Type] = Map()) extends ParserCtx

object MyParser extends Parser[MyParserCtx]:
val root = rule:
case Decl(d) => d

val Decl = rule:
case (MyLexer.VAR(_), MyLexer.ID(id), MyLexer.TYPE(t)) =>
ctx.symbols += (id.value -> t.value)
// ...
```

## 4. Mode Switching (Lexical Feedback)

Sometimes you need to change how the lexer behaves based on what it just matched. For example, when parsing a string with interpolation:
`"Hello ${user.name}!"`

While Alpaca doesn't support "real-time" feedback from the parser to the lexer (as the lexer finishes first), you can implement modes within the lexer using context state.

```scala
case class ModeCtx(var inString: Boolean = false, var text: CharSequence = "") extends LexerCtx

val modeLexer = lexer[ModeCtx]:
case """ =>
ctx.inString = !ctx.inString
Token["QUOTE"]

case "[a-z]+" if !ctx.inString => Token["KEYWORD"]
case "[^"]+" if ctx.inString => Token["STRING_CONTENT"]
Comment on lines +102 to +107
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mode-switching example has invalid Scala string literals: case """ => is an unterminated triple-quoted string, and the regex pattern "[^"]+" contains an unescaped quote. Please rewrite these patterns using valid Scala literals (often easiest with properly delimited triple-quoted strings) so the example can be copied verbatim.

Suggested change
case """ =>
ctx.inString = !ctx.inString
Token["QUOTE"]
case "[a-z]+" if !ctx.inString => Token["KEYWORD"]
case "[^"]+" if ctx.inString => Token["STRING_CONTENT"]
case "\"" =>
ctx.inString = !ctx.inString
Token["QUOTE"]
case "[a-z]+" if !ctx.inString => Token["KEYWORD"]
case """[^"]+""" if ctx.inString => Token["STRING_CONTENT"]

Copilot uses AI. Check for mistakes.
```

## 5. The `BetweenStages` Hook

The `BetweenStages` hook is the internal engine that powers context updates. It is a function called by Alpaca after **every** token match (including `Token.Ignored`) but **before** the next match starts.

### Automatic Updates
By default, Alpaca uses `BetweenStages` to automatically update the `text` field in your context. If your context extends `LineTracking` or `PositionTracking`, it also increments `line` and `position` counters.

### Customizing `BetweenStages`
If you need complex logic to run after every match, you can provide a custom `given` instance of `BetweenStages`.

```scala
given MyBetweenStages: BetweenStages[MyCtx] with
def apply(token: Token[?, MyCtx, ?], matcher: Matcher, ctx: MyCtx): Unit =
// Custom global logic
println(s"Just matched ${token.info.name}")
```
Comment on lines +110 to +125
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guide suggests customizing BetweenStages via a user-provided given, but BetweenStages is currently declared private[alpaca] (see src/alpaca/internal/lexer/BetweenStages.scala), so downstream users can’t reference or implement it. Either expose BetweenStages as part of the public API (or provide a public hook) or adjust the documentation to reflect the supported customization mechanisms (e.g., mixing in LineTracking/PositionTracking).

Copilot uses AI. Check for mistakes.

## Summary of Data Flow

1. **Input String** flows into the `lexer`.
2. **`BetweenStages`** updates the `LexerCtx` after every match.
3. **`Lexeme`s** are produced, each capturing the current `LexerCtx` state.
4. **List[Lexeme]** flows into the `parser`.
5. **`ParserCtx`** is initialized and updated as rules are reduced.
6. **Result** is produced, along with the final `ParserCtx`.
90 changes: 90 additions & 0 deletions docs/_docs/guides/lexer-error-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Guide: Lexer Error Handling

By default, Alpaca's lexer is strict: it throws a `RuntimeException` as soon as it encounters a character that doesn't match any of your defined token patterns.

However, for real-world applications, you often want the lexer to be more resilient, either by reporting multiple errors or by skipping invalid characters and continuing.
This guide explores strategies for implementing custom error handling in your lexer.

## 1. Default Behavior

When `tokenize` fails to find a match, it throws an exception:
```text
java.lang.RuntimeException: Unexpected character: '?'
```

## 2. Strategy: Catch-all Error Token

The most common strategy is to add a pattern at the **end** of your lexer that matches any single character.
This "catch-all" pattern will only be reached if no other token matches.

```scala
val myLexer = lexer:
case "[0-9]+" => Token["NUM"]
// ... other tokens ...
case "\\s+" => Token.Ignored

// Catch-all: matches any single character that wasn't matched above
case "." => Token["ERROR"]
```

By emitting an `ERROR` token, the lexer can continue tokenizing the rest of the input.
Your parser can then decide how to handle these `ERROR` tokens.

## 3. Strategy: Error Counting in Context

You can use a custom `LexerCtx` to track the number of errors encountered during tokenization.

```scala
case class ErrorCtx(
var text: CharSequence = "",
var errorCount: Int = 0
) extends LexerCtx

val myLexer = lexer[ErrorCtx]:
case "[a-z]+" => Token["ID"]
case "\\s+" => Token.Ignored

case x @ "." =>
ctx.errorCount += 1
println(s"Error: Unexpected character '$x' at position ${ctx.position}")
Token.Ignored // Skip the character
Comment on lines +38 to +50
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ErrorCtx example logs ${ctx.position}, but position is not a member of LexerCtx unless the context mixes in PositionTracking (or uses LexerCtx.Default). Update the example context definition accordingly so it compiles and matches the described behavior.

Copilot uses AI. Check for mistakes.
```

## 4. Strategy: Error Recovery via Ignored Tokens

If you want to simply skip unexpected characters and proceed as if they weren't there, you can use `Token.Ignored` in your catch-all case.

```scala
val resilientLexer = lexer:
case "[0-9]+" => Token["NUM"]
case "\s+" => Token.Ignored
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the resilient lexer example, the whitespace regex is written as "\s+" (single backslash). In a Scala string literal this is an invalid escape sequence; use "\\s+" (or a triple-quoted string) to represent the \s+ regex correctly.

Suggested change
case "\s+" => Token.Ignored
case "\\s+" => Token.Ignored

Copilot uses AI. Check for mistakes.

// Log and ignore
case x @ "." =>
reportError(x)
Token.Ignored
```

## 5. Implementation Considerations

### Precedence
Always place your catch-all pattern at the **very bottom** of your `lexer` block. Since Alpaca tries to match patterns in order (or uses the longest match with precedence), putting a `.` at the top would match everything and shadow your other rules.

### Performance
A catch-all `.` pattern can slightly impact performance if your input contains many invalid sequences, as it will be matched character-by-character. However, for most use cases, the overhead is negligible.

### Parser Integration
If your lexer produces `ERROR` tokens, your parser rules should be prepared to handle them, or you should filter them out before passing the lexeme list to `parser.parse()`.

```scala
val (finalCtx, lexemes) = myLexer.tokenize(input)
val validLexemes = lexemes.filter(_.name != "ERROR")
val result = myParser.parse(validLexemes)
```

## Summary

- **Default**: Fails fast with an exception.
- **Catch-all**: Use `case "."` to capture invalid characters.
- **Resilience**: Use `LexerCtx` to track and report errors without stopping.
- **Emitters**: Choose between emitting an `ERROR` token or using `Token.Ignored` to skip.