Skip to content

group_by: partition_by does not support inline func expressions #1676

@dmpetrov

Description

@dmpetrov

Description

When passing an inline func.* expression directly to partition_by in group_by, DataChain silently derives an internal column name and then fails with a SignalResolvingError because that column doesn't exist in the schema yet.

Steps to reproduce

import datachain as dc
from datachain import C, func

(
    dc.read_storage("gs://datachain-demo/", anon=True)
    .group_by(
        count=func.count(),
        total_size=func.sum(C("file.size")),
        partition_by=func.path.parent(C("file.path")),  # inline expression
    )
    .show()
)

Error

datachain.lib.signal_schema.SignalResolvingError: cannot resolve signal name 'file__path__parent': is not found

Expected behavior

Either:

  • Inline expressions in partition_by are evaluated automatically (like they are in mutate), or
  • A clear error message explaining that partition_by requires a pre-existing column name, not an expression

Workaround

Materialize the column with .mutate() first:

(
    dc.read_storage("gs://datachain-demo/", anon=True)
    .mutate(parent=func.path.parent(C("file.path")))
    .group_by(
        count=func.count(),
        total_size=func.sum(C("file.size")),
        partition_by="parent",
    )
    .show()
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions