Skip to content

Add literals support#169

Draft
halotukozak wants to merge 13 commits intomasterfrom
literals
Draft

Add literals support#169
halotukozak wants to merge 13 commits intomasterfrom
literals

Conversation

@halotukozak
Copy link
Copy Markdown
Owner

@halotukozak halotukozak commented Dec 30, 2025

AI-Generated Description

Add Literals Support

This pull request introduces support for literals, alongside significant refactoring to improve code clarity, maintainability, and performance. The changes streamline token handling, simplify debug utilities, and enhance context ownership, resulting in a more robust and efficient codebase.

Key Changes

  • Literals Support: Implemented support for literals in the lexer and parser components.
  • Refactored DebugSettings: Updated DebugSettings to use a parameterized type for improved type safety and flexibility.
  • Token Handling Improvements: Adjusted token handling and refined context ownership for better clarity and efficiency.
  • Simplified Transformations: Consolidated and optimized tree transformations, debug utilities, and token compilation routines.
  • Removed DebugPosition: Eliminated the DebugPosition opaque type to simplify and streamline debug utilities.

Technical Details

  • Updates to lexer and token-related files (e.g., Lexer.scala, Token.scala, Tokenization.scala) include enhanced handling of literals and improved token processing logic.
  • Debug utilities were simplified by removing redundant structures, such as the DebugPosition opaque type, and improving clarity in debug-related code paths.
  • Tests were updated to reflect the new behavior and ensure coverage for literals and refactored components. This includes changes to integration tests (e.g., CompilationTheoryTest, JsonTest) and unit tests (e.g., LexerTest, LexerApiTest).
  • Build file adjustments ensure compatibility with the new changes.

This PR results in 491 additions and 387 deletions across 22 files, reflecting both the addition of new functionality and the removal of redundant or outdated code.


Statistics

  • 22 files changed
  • 491 additions
  • 387 deletions

Copilot AI review requested due to automatic review settings December 30, 2025 20:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for character literals in lexer definitions, allowing developers to write case '+' => Token["PLUS"] instead of requiring regex strings like case "\\+" => Token["PLUS"]. This makes lexer definitions more concise and readable for single-character tokens.

Key changes:

  • Extended lexer pattern matching to accept both String and Char literals
  • Modified TokenInfo to store patterns as Seq[String | Char] instead of a single String
  • Added byLiteral map to Tokenization for efficient character-based token lookup

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
test/src/alpaca/internal/parser/ParseTableTest.scala Updated test to use character literals ('+') instead of regex strings ("\\+")
test/src/alpaca/internal/lexer/RegexCheckerTest.scala Changed test assertions to match new return type (tuples instead of error strings)
test/src/alpaca/internal/lexer/LexerTest.scala Replaced regex patterns with character literals and shouldBe assertions
test/src/alpaca/integration/MathTest.scala Converted single-character operators to use character literal syntax
test/src/alpaca/integration/MainTest.scala Applied character literal syntax for operators and parentheses
test/src/alpaca/integration/JsonTest.scala Updated JSON lexer to use character literals for brackets and punctuation
test/src/alpaca/integration/CompilationTheoryTest.scala Modified compilation theory test lexer with character literals
test/src/alpaca/ParserApiTest.scala Updated parser test lexer to use character literals
test/src/alpaca/LexerApiTest.scala Comprehensive test demonstrating mixed regex and literal patterns (contains syntax errors)
test/src/alpaca/CtxRemappingTest.scala Updated context remapping tests with character literals
src/alpaca/lexer.scala Added LexerDefinition, TokenDefinition, and IgnoredTokenDefinition types supporting `String
src/alpaca/internal/lexer/Tokenization.scala Added byLiteral map for character-based token lookup
src/alpaca/internal/lexer/Token.scala Extended TokenInfo to support `Seq[String
src/alpaca/internal/lexer/RegexChecker.scala Modified to return tuple indices instead of error messages, added checkInfos method
src/alpaca/internal/lexer/Lexer.scala Implemented byLiteral map generation and updated token compilation logic
src/alpaca/internal/lexer/CompileNameAndPattern.scala Added support for extracting character literals and union types
src/alpaca/internal/internal.scala Added changeOwner to lambda generation and moved ToExpr/FromExpr givens
src/alpaca/internal/debugUtils.scala Removed redundant string extension methods
src/alpaca/internal/ValidName.scala Added ValidNameLike type alias for `(Char
src/alpaca/internal/UnsafeExtensions.scala Enhanced error messages with source position information
src/alpaca/DebugSettings.scala Removed unused import
build.mill Updated Scala version to 3.8.0-RC3 and commented out -explain flag

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def checkInfos(infos: Seq[TokenInfo[?]])(using quotes: Quotes): Unit = for
patterns = infos.map(_.toEscapedRegex)
(j, i) <- checkPatterns(patterns)
do quotes.reflect.report.error(s"Pattern ${infos(j).toRegex} is shadowed by ${infos(i).toRegex}")
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message uses toRegex which doesn't escape regex special characters. This will produce confusing error messages showing raw characters like '+' or '(' instead of their escaped forms '+' or '('. Consider using a method that preserves the original pattern representation or creates a user-friendly display format that clearly shows what patterns are being matched.

Suggested change
do quotes.reflect.report.error(s"Pattern ${infos(j).toRegex} is shadowed by ${infos(i).toRegex}")
do quotes.reflect.report.error(s"Pattern ${infos(j).toEscapedRegex} is shadowed by ${infos(i).toEscapedRegex}")

Copilot uses AI. Check for mistakes.
val all = Expr.ofSeq {
(definedTokenVals ++ ignoredTokenVals).collect:
case valDef if valDef.name.length == 1 =>
Expr.ofTuple((Expr(valDef.name.head), Ref(valDef.symbol).asExprOf[DefinedToken[?, Ctx, ?]]))
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byLiteral map is being populated with both defined and ignored tokens, but at line 234, both are being cast to DefinedToken[?, Ctx, ?]. However, ignoredTokenVals contains IgnoredToken instances, not DefinedToken instances. This type mismatch will cause a compilation error or runtime type cast exception. The type should be Token[?, Ctx, ?] to accommodate both token types.

Suggested change
Expr.ofTuple((Expr(valDef.name.head), Ref(valDef.symbol).asExprOf[DefinedToken[?, Ctx, ?]]))
Expr.ofTuple((Expr(valDef.name.head), Ref(valDef.symbol).asExprOf[Token[?, Ctx, ?]]))

Copilot uses AI. Check for mistakes.
.map:
case TokenInfo(_, regexGroupName, pattern) => s"(?<$regexGroupName>$pattern)"
.map: tokenInfo =>
val regex = tokenInfo.toEscapedRegex // todo: literals should be handled separately
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO comment mentions that "literals should be handled separately", but the current implementation treats them the same as regex patterns by escaping them. Consider clarifying what the intended separate handling should be - for example, should literals be matched directly as strings before attempting regex matching for better performance, or is there another reason they should be separate?

Copilot uses AI. Check for mistakes.
@@ -68,23 +68,18 @@ private[internal] final class CreateLambda[Q <: Quotes](using val quotes: Q) {
Lambda(
Symbol.spliceOwner,
MethodType(params.zipWithIndex.map((_, i) => s"$$arg$i"))(_ => params, _ => r),
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changeOwner(sym) call is added to the result of rhsFn.unsafeApply. While this may be necessary for correct symbol ownership in the macro-generated code, ensure this doesn't break existing functionality where the owner was previously not changed. If this fixes a bug, consider documenting why this change is needed.

Suggested change
MethodType(params.zipWithIndex.map((_, i) => s"$$arg$i"))(_ => params, _ => r),
MethodType(params.zipWithIndex.map((_, i) => s"$$arg$i"))(_ => params, _ => r),
// Re-own the body tree under the synthetic method symbol `sym` created by `Lambda`.
// `rhsFn` typically builds its tree under `Symbol.spliceOwner`; without `changeOwner(sym)`
// the resulting lambda body would have an incorrect owner and could produce invalid trees.

Copilot uses AI. Check for mistakes.
case x: "[a-zA-Z_][a-zA-Z0-9_]*" => Token["id"](x)
case lit: ('<' | '>' | '=' | '+' | '-' | '*' | '/' | '(' | ')' | '[' | ']' | '{' | '}' | ':' | '\\' | ',' | ';') =>
Token[lit.type]
// }
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closing brace of the lexer definition is commented out. This will cause the test file to fail to compile. The comment marker should be removed to properly close the lexer definition.

Suggested change
// }
}

Copilot uses AI. Check for mistakes.
"print") =>
Token[keyword.type]
case x: "[a-zA-Z_][a-zA-Z0-9_]*" => Token["id"](x)
case lit: ('<' | '>' | '=' | '+' | '-' | '*' | '/' | '(' | ')' | '[' | ']' | '{' | '}' | ':' | '\\' | ',' | ';') =>
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The character literal '\' (backslash) in this union type is incorrect. A single backslash character literal should be written as '\\' (escaped backslash) in Scala. The current syntax '\' would be interpreted as an escape sequence rather than a literal backslash character.

Copilot uses AI. Check for mistakes.
case c: Char => Regex.quote(c.toString)
.mkString("|")

lazy val toRegex: String = patterns.mkString("|")
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The toRegex method at line 41 converts patterns to a regex string by calling mkString("|") directly on the sequence. For Char elements, this will only include the character itself without proper regex escaping. This can cause issues with special regex characters like '.', '*', '+', etc. Consider using toEscapedRegex for all regex construction or ensuring characters are properly escaped when used in regex contexts.

Suggested change
lazy val toRegex: String = patterns.mkString("|")
lazy val toRegex: String = toEscapedRegex

Copilot uses AI. Check for mistakes.
# Conflicts:
#	build.mill
#	src/alpaca/DebugSettings.scala
#	src/alpaca/internal/DebugSettings.scala
#	src/alpaca/internal/Showable.scala
#	src/alpaca/internal/UnsafeExtensions.scala
#	src/alpaca/internal/ValidName.scala
#	src/alpaca/internal/debugUtils.scala
#	src/alpaca/internal/internal.scala
#	src/alpaca/internal/lexer/CompileNameAndPattern.scala
#	src/alpaca/internal/lexer/Lexer.scala
#	src/alpaca/internal/lexer/RegexChecker.scala
#	src/alpaca/internal/lexer/Token.scala
#	src/alpaca/internal/lexer/Tokenization.scala
#	src/alpaca/internal/parser/Parser.scala
#	src/alpaca/internal/parser/createTables.scala
#	src/alpaca/lexer.scala
#	test/src/alpaca/CtxRemappingTest.scala
#	test/src/alpaca/ParserApiTest.scala
#	test/src/alpaca/integration/CompilationTheoryTest.scala
#	test/src/alpaca/integration/JsonTest.scala
#	test/src/alpaca/integration/MainTest.scala
#	test/src/alpaca/integration/MathTest.scala
#	test/src/alpaca/internal/parser/ParseTableTest.scala
# Conflicts:
#	build.mill
#	src/alpaca/DebugSettings.scala
#	src/alpaca/internal/DebugSettings.scala
#	src/alpaca/internal/Showable.scala
#	src/alpaca/internal/UnsafeExtensions.scala
#	src/alpaca/internal/ValidName.scala
#	src/alpaca/internal/debugUtils.scala
#	src/alpaca/internal/internal.scala
#	src/alpaca/internal/lexer/CompileNameAndPattern.scala
#	src/alpaca/internal/lexer/Lexer.scala
#	src/alpaca/internal/lexer/RegexChecker.scala
#	src/alpaca/internal/lexer/Token.scala
#	src/alpaca/internal/lexer/Tokenization.scala
#	src/alpaca/internal/parser/Parser.scala
#	src/alpaca/internal/parser/createTables.scala
#	src/alpaca/lexer.scala
#	test/src/alpaca/CtxRemappingTest.scala
#	test/src/alpaca/ParserApiTest.scala
#	test/src/alpaca/integration/CompilationTheoryTest.scala
#	test/src/alpaca/integration/JsonTest.scala
#	test/src/alpaca/integration/MainTest.scala
#	test/src/alpaca/integration/MathTest.scala
#	test/src/alpaca/internal/parser/ParseTableTest.scala
# Conflicts:
#	src/alpaca/internal/internal.scala
#	src/alpaca/internal/lexer/Lexer.scala
…sing `treeToStr`, improve readability, and remove redundant `.unsafeMap` calls.
@halotukozak halotukozak marked this pull request as draft January 1, 2026 20:38
# Conflicts:
#	src/alpaca/internal/lexer/Lexer.scala
@halotukozak halotukozak changed the base branch from master to token-definition January 1, 2026 20:43
# Conflicts:
#	src/alpaca/internal/lexer/Lexer.scala
#	src/alpaca/lexer.scala
#	test/src/alpaca/LexerApiTest.scala
Base automatically changed from token-definition to master January 7, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants