Add literals support by halotukozak · Pull Request #169 · halotukozak/alpaca

halotukozak · 2025-12-30T20:55:49Z

AI-Generated Description

Add Literals Support

This pull request introduces support for literals, alongside significant refactoring to improve code clarity, maintainability, and performance. The changes streamline token handling, simplify debug utilities, and enhance context ownership, resulting in a more robust and efficient codebase.

Key Changes

Literals Support: Implemented support for literals in the lexer and parser components.
Refactored DebugSettings: Updated DebugSettings to use a parameterized type for improved type safety and flexibility.
Token Handling Improvements: Adjusted token handling and refined context ownership for better clarity and efficiency.
Simplified Transformations: Consolidated and optimized tree transformations, debug utilities, and token compilation routines.
Removed DebugPosition: Eliminated the DebugPosition opaque type to simplify and streamline debug utilities.

Technical Details

Updates to lexer and token-related files (e.g., Lexer.scala, Token.scala, Tokenization.scala) include enhanced handling of literals and improved token processing logic.
Debug utilities were simplified by removing redundant structures, such as the DebugPosition opaque type, and improving clarity in debug-related code paths.
Tests were updated to reflect the new behavior and ensure coverage for literals and refactored components. This includes changes to integration tests (e.g., CompilationTheoryTest, JsonTest) and unit tests (e.g., LexerTest, LexerApiTest).
Build file adjustments ensure compatibility with the new changes.

This PR results in 491 additions and 387 deletions across 22 files, reflecting both the addition of new functionality and the removal of redundant or outdated code.

Statistics

22 files changed
491 additions
387 deletions

…ne related usage

…oved clarity and efficiency

…lation routines

…ties

Copilot

Pull request overview

This PR adds support for character literals in lexer definitions, allowing developers to write case '+' => Token["PLUS"] instead of requiring regex strings like case "\\+" => Token["PLUS"]. This makes lexer definitions more concise and readable for single-character tokens.

Key changes:

Extended lexer pattern matching to accept both String and Char literals
Modified TokenInfo to store patterns as Seq[String | Char] instead of a single String
Added byLiteral map to Tokenization for efficient character-based token lookup

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
test/src/alpaca/internal/parser/ParseTableTest.scala	Updated test to use character literals (`'+'`) instead of regex strings (`"\\+"`)
test/src/alpaca/internal/lexer/RegexCheckerTest.scala	Changed test assertions to match new return type (tuples instead of error strings)
test/src/alpaca/internal/lexer/LexerTest.scala	Replaced regex patterns with character literals and `shouldBe` assertions
test/src/alpaca/integration/MathTest.scala	Converted single-character operators to use character literal syntax
test/src/alpaca/integration/MainTest.scala	Applied character literal syntax for operators and parentheses
test/src/alpaca/integration/JsonTest.scala	Updated JSON lexer to use character literals for brackets and punctuation
test/src/alpaca/integration/CompilationTheoryTest.scala	Modified compilation theory test lexer with character literals
test/src/alpaca/ParserApiTest.scala	Updated parser test lexer to use character literals
test/src/alpaca/LexerApiTest.scala	Comprehensive test demonstrating mixed regex and literal patterns (contains syntax errors)
test/src/alpaca/CtxRemappingTest.scala	Updated context remapping tests with character literals
src/alpaca/lexer.scala	Added `LexerDefinition`, `TokenDefinition`, and `IgnoredTokenDefinition` types supporting `String
src/alpaca/internal/lexer/Tokenization.scala	Added `byLiteral` map for character-based token lookup
src/alpaca/internal/lexer/Token.scala	Extended `TokenInfo` to support `Seq[String
src/alpaca/internal/lexer/RegexChecker.scala	Modified to return tuple indices instead of error messages, added `checkInfos` method
src/alpaca/internal/lexer/Lexer.scala	Implemented `byLiteral` map generation and updated token compilation logic
src/alpaca/internal/lexer/CompileNameAndPattern.scala	Added support for extracting character literals and union types
src/alpaca/internal/internal.scala	Added `changeOwner` to lambda generation and moved `ToExpr`/`FromExpr` givens
src/alpaca/internal/debugUtils.scala	Removed redundant string extension methods
src/alpaca/internal/ValidName.scala	Added `ValidNameLike` type alias for `(Char
src/alpaca/internal/UnsafeExtensions.scala	Enhanced error messages with source position information
src/alpaca/DebugSettings.scala	Removed unused import
build.mill	Updated Scala version to 3.8.0-RC3 and commented out `-explain` flag

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-30T20:59:54Z

src/alpaca/internal/lexer/RegexChecker.scala

+  def checkInfos(infos: Seq[TokenInfo[?]])(using quotes: Quotes): Unit = for
+    patterns = infos.map(_.toEscapedRegex)
+    (j, i) <- checkPatterns(patterns)
+  do quotes.reflect.report.error(s"Pattern ${infos(j).toRegex} is shadowed by ${infos(i).toRegex}")


The error message uses toRegex which doesn't escape regex special characters. This will produce confusing error messages showing raw characters like '+' or '(' instead of their escaped forms '+' or '('. Consider using a method that preserves the original pattern representation or creates a user-friendly display format that clearly shows what patterns are being matched.

Suggested change

do quotes.reflect.report.error(s"Pattern ${infos(j).toRegex} is shadowed by ${infos(i).toRegex}")

do quotes.reflect.report.error(s"Pattern ${infos(j).toEscapedRegex} is shadowed by ${infos(i).toEscapedRegex}")

Copilot · 2025-12-30T20:59:54Z

src/alpaca/internal/lexer/Lexer.scala

+          val all = Expr.ofSeq {
+            (definedTokenVals ++ ignoredTokenVals).collect:
+              case valDef if valDef.name.length == 1 =>
+                Expr.ofTuple((Expr(valDef.name.head), Ref(valDef.symbol).asExprOf[DefinedToken[?, Ctx, ?]]))


The byLiteral map is being populated with both defined and ignored tokens, but at line 234, both are being cast to DefinedToken[?, Ctx, ?]. However, ignoredTokenVals contains IgnoredToken instances, not DefinedToken instances. This type mismatch will cause a compilation error or runtime type cast exception. The type should be Token[?, Ctx, ?] to accommodate both token types.

Suggested change

Expr.ofTuple((Expr(valDef.name.head), Ref(valDef.symbol).asExprOf[DefinedToken[?, Ctx, ?]]))

Expr.ofTuple((Expr(valDef.name.head), Ref(valDef.symbol).asExprOf[Token[?, Ctx, ?]]))

Copilot · 2025-12-30T20:59:55Z

src/alpaca/internal/lexer/Lexer.scala

-              .map:
-                case TokenInfo(_, regexGroupName, pattern) => s"(?<$regexGroupName>$pattern)"
+              .map: tokenInfo =>
+                val regex = tokenInfo.toEscapedRegex // todo: literals should be handled separately


This TODO comment mentions that "literals should be handled separately", but the current implementation treats them the same as regex patterns by escaping them. Consider clarifying what the intended separate handling should be - for example, should literals be matched directly as strings before attempting regex matching for better performance, or is there another reason they should be separate?

Copilot · 2025-12-30T20:59:55Z

src/alpaca/internal/internal.scala

@@ -68,23 +68,18 @@ private[internal] final class CreateLambda[Q <: Quotes](using val quotes: Q) {
    Lambda(
      Symbol.spliceOwner,
      MethodType(params.zipWithIndex.map((_, i) => s"$$arg$i"))(_ => params, _ => r),


The changeOwner(sym) call is added to the result of rhsFn.unsafeApply. While this may be necessary for correct symbol ownership in the macro-generated code, ensure this doesn't break existing functionality where the owner was previously not changed. If this fixes a bug, consider documenting why this change is needed.

Suggested change

MethodType(params.zipWithIndex.map((_, i) => s"$$arg$i"))(_ => params, _ => r),

MethodType(params.zipWithIndex.map((_, i) => s"$$arg$i"))(_ => params, _ => r),

// Re-own the body tree under the synthetic method symbol `sym` created by `Lambda`.

// `rhsFn` typically builds its tree under `Symbol.spliceOwner`; without `changeOwner(sym)`

// the resulting lambda body would have an incorrect owner and could produce invalid trees.

Copilot · 2025-12-30T20:59:55Z

test/src/alpaca/LexerApiTest.scala

+    case x: "[a-zA-Z_][a-zA-Z0-9_]*" => Token["id"](x)
+     case lit: ('<' | '>' | '=' | '+' | '-' | '*' | '/' | '(' | ')' | '[' | ']' | '{' | '}' | ':' | '\\' | ',' | ';') =>
+       Token[lit.type]
+//  }


The closing brace of the lexer definition is commented out. This will cause the test file to fail to compile. The comment marker should be removed to properly close the lexer definition.

Suggested change

// }

}

Copilot · 2025-12-30T20:59:55Z

test/src/alpaca/LexerApiTest.scala

+           "print") =>
+       Token[keyword.type]
+    case x: "[a-zA-Z_][a-zA-Z0-9_]*" => Token["id"](x)
+     case lit: ('<' | '>' | '=' | '+' | '-' | '*' | '/' | '(' | ')' | '[' | ']' | '{' | '}' | ':' | '\\' | ',' | ';') =>


The character literal '\' (backslash) in this union type is incorrect. A single backslash character literal should be written as '\\' (escaped backslash) in Scala. The current syntax '\' would be interpreted as an escape sequence rather than a literal backslash character.

Copilot · 2025-12-30T20:59:56Z

src/alpaca/internal/lexer/Token.scala

+      case c: Char => Regex.quote(c.toString)
+    .mkString("|")
+
+  lazy val toRegex: String = patterns.mkString("|")


The toRegex method at line 41 converts patterns to a regex string by calling mkString("|") directly on the sequence. For Char elements, this will only include the character itself without proper regex escaping. This can cause issues with special regex characters like '.', '*', '+', etc. Consider using toEscapedRegex for all regex construction or ensuring characters are properly escaped when used in regex contexts.

Suggested change

lazy val toRegex: String = patterns.mkString("|")

lazy val toRegex: String = toEscapedRegex

# Conflicts: # build.mill # src/alpaca/DebugSettings.scala # src/alpaca/internal/DebugSettings.scala # src/alpaca/internal/Showable.scala # src/alpaca/internal/UnsafeExtensions.scala # src/alpaca/internal/ValidName.scala # src/alpaca/internal/debugUtils.scala # src/alpaca/internal/internal.scala # src/alpaca/internal/lexer/CompileNameAndPattern.scala # src/alpaca/internal/lexer/Lexer.scala # src/alpaca/internal/lexer/RegexChecker.scala # src/alpaca/internal/lexer/Token.scala # src/alpaca/internal/lexer/Tokenization.scala # src/alpaca/internal/parser/Parser.scala # src/alpaca/internal/parser/createTables.scala # src/alpaca/lexer.scala # test/src/alpaca/CtxRemappingTest.scala # test/src/alpaca/ParserApiTest.scala # test/src/alpaca/integration/CompilationTheoryTest.scala # test/src/alpaca/integration/JsonTest.scala # test/src/alpaca/integration/MainTest.scala # test/src/alpaca/integration/MathTest.scala # test/src/alpaca/internal/parser/ParseTableTest.scala

# Conflicts: # src/alpaca/internal/internal.scala # src/alpaca/internal/lexer/Lexer.scala

…ings and switch annotation.

…unused code, and improve type safety.

…sing `treeToStr`, improve readability, and remove redundant `.unsafeMap` calls.

# Conflicts: # src/alpaca/internal/lexer/Lexer.scala

# Conflicts: # src/alpaca/internal/lexer/Lexer.scala # src/alpaca/lexer.scala # test/src/alpaca/LexerApiTest.scala

halotukozak added 5 commits December 23, 2025 12:24

Add literals support

dbd4d51

refactor: update DebugSettings to a parameterized type and streamli…

7f70fc0

…ne related usage

refactor: adjust token handling and refine context ownership for impr…

334deeb

…oved clarity and efficiency

refactor: simplify tree transformations, debug utils, and token compi…

f2cc79e

…lation routines

refactor: remove DebugPosition opaque type and simplify debug utili…

7a16428

…ties

Copilot AI review requested due to automatic review settings December 30, 2025 20:55

Copilot started reviewing on behalf of halotukozak December 30, 2025 20:56 View session

Copilot AI reviewed Dec 30, 2025

View reviewed changes

halotukozak added 5 commits December 30, 2025 22:45

Merge remote-tracking branch 'origin/literals' into literals

4ffa1fc

# Conflicts: # src/alpaca/internal/internal.scala # src/alpaca/internal/lexer/Lexer.scala

Refactor byLiteral to support method-based access with improved typ…

f044886

…ings and switch annotation.

Refactor lexer internals to replace .unsafeMap with .map, remove …

99e8fcf

…unused code, and improve type safety.

halotukozak force-pushed the literals branch from 42d3ccf to 99e8fcf Compare December 30, 2025 22:57

Refactor CompileNameAndPattern to simplify token extraction logic u…

67b1a88

…sing `treeToStr`, improve readability, and remove redundant `.unsafeMap` calls.

halotukozak marked this pull request as draft January 1, 2026 20:38

Merge branch 'ignored-tokens-vals' into token-definition

fe6cb91

# Conflicts: # src/alpaca/internal/lexer/Lexer.scala

halotukozak changed the base branch from master to token-definition January 1, 2026 20:43

Merge branch 'token-definition' into literals

830a7f6

# Conflicts: # src/alpaca/internal/lexer/Lexer.scala # src/alpaca/lexer.scala # test/src/alpaca/LexerApiTest.scala

halotukozak force-pushed the literals branch from 8bb0dd1 to 830a7f6 Compare January 1, 2026 20:51

github-actions bot added build enhancement New feature or request Lexer Parser performance refactoring testing labels Jan 1, 2026

halotukozak force-pushed the token-definition branch from fe6cb91 to b8c9aba Compare January 1, 2026 21:44

Base automatically changed from token-definition to master January 7, 2026 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add literals support#169

Add literals support#169
halotukozak wants to merge 13 commits intomasterfrom
literals

halotukozak commented Dec 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

Copilot AI Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	do quotes.reflect.report.error(s"Pattern ${infos(j).toRegex} is shadowed by ${infos(i).toRegex}")
	do quotes.reflect.report.error(s"Pattern ${infos(j).toEscapedRegex} is shadowed by ${infos(i).toEscapedRegex}")

	Expr.ofTuple((Expr(valDef.name.head), Ref(valDef.symbol).asExprOf[DefinedToken[?, Ctx, ?]]))
	Expr.ofTuple((Expr(valDef.name.head), Ref(valDef.symbol).asExprOf[Token[?, Ctx, ?]]))

-      MethodType(params.zipWithIndex.map((_, i) => s"$$arg$i"))(_ => params, _ => r),
+      MethodType(params.zipWithIndex.map((_, i) => s"$$arg$i"))(_ => params, _ => r),
+      // Re-own the body tree under the synthetic method symbol `sym` created by `Lambda`.
+      // `rhsFn` typically builds its tree under `Symbol.spliceOwner`; without `changeOwner(sym)`
+      // the resulting lambda body would have an incorrect owner and could produce invalid trees.

	lazy val toRegex: String = patterns.mkString("\|")
	lazy val toRegex: String = toEscapedRegex

Conversation

halotukozak commented Dec 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI-Generated Description

Add Literals Support

Key Changes

Technical Details

Statistics

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

halotukozak commented Dec 30, 2025 •

edited by github-actions bot

Loading