-
Notifications
You must be signed in to change notification settings - Fork 140
Open
Description
Description
When passing an inline func.* expression directly to partition_by in group_by, DataChain silently derives an internal column name and then fails with a SignalResolvingError because that column doesn't exist in the schema yet.
Steps to reproduce
import datachain as dc
from datachain import C, func
(
dc.read_storage("gs://datachain-demo/", anon=True)
.group_by(
count=func.count(),
total_size=func.sum(C("file.size")),
partition_by=func.path.parent(C("file.path")), # inline expression
)
.show()
)Error
datachain.lib.signal_schema.SignalResolvingError: cannot resolve signal name 'file__path__parent': is not found
Expected behavior
Either:
- Inline expressions in
partition_byare evaluated automatically (like they are inmutate), or - A clear error message explaining that
partition_byrequires a pre-existing column name, not an expression
Workaround
Materialize the column with .mutate() first:
(
dc.read_storage("gs://datachain-demo/", anon=True)
.mutate(parent=func.path.parent(C("file.path")))
.group_by(
count=func.count(),
total_size=func.sum(C("file.size")),
partition_by="parent",
)
.show()
)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels