feat: Add support for nw.int_range for eager backends#2895
feat: Add support for nw.int_range for eager backends#2895
nw.int_range for eager backends#2895Conversation
|
Hey @FBruzzesi, as I've mentioned 8347839 times, I'm very keen to see this in Would you be okay if we let this one marinate for a bit and maybe target the release after the next? 🙏 |
🎉
Yeah I was not expecting this to be finalized by tomorrow. It has some sharp edges that need some love and work |
|
@MarcoGorelli any "trick" to avoid the custom check in precommit?
I thought about emulating that |
You could add this to IntegerDType: TypeAlias = "dtypes.IntegerType | type[dtypes.IntegerType]" |
#2982 removed from everywhere else
> TypeError: argument 'step': 'PolarsExpr' object cannot be interpreted as an integer https://github.com/narwhals-dev/narwhals/actions/runs/17499840563/job/49709898155?pr=2895
@MarcoGorelli pinging again as (#2895 (comment)) was a month ago 🫣 |
|
@FBruzzesi I still have hope that eventually we'll land this 🙏 I have an idea of how we could reuse IdeaAs an example, if we convert the output of import polars as pl
import pyarrow as pa
start, end = dt.date(2000, 1, 1), dt.date(2000, 1, 5)
expr = pl.date_range(start, end).alias("date")
table = pl.select(expr).to_arrow()
table.column("date").typeSee Since that is just represented as a 32-bit integer, we can do things like: dates = table.column("date")
datesShow Output
dates.cast(pa.int32())Show Output
dates.cast(pa.int32()).cast(pa.date32())Show Output
In practiceimport datetime as dt
import pyarrow as pa
def date_range(
start: dt.date,
end: dt.date,
interval: int, # (* assuming the `Interval` part is solved)
*,
closed: ClosedInterval = "both",
) -> pa.Date32Array:
start_i = pa.scalar(start).cast(pa.int32()).as_py()
end_i = pa.scalar(end).cast(pa.int32()).as_py()
# call `int_range` here for the compatibility branch
arr = pa.arange(start_i, end_i + 1, interval)
if closed != "both":
if closed == "left":
arr = arr.slice(length=len(arr) - 1)
elif closed == "none":
arr = arr.slice(1, len(arr) - 1)
else:
arr = arr.slice(1)
# the first cast would happen in `int_range(dtype=...)`
return arr.cast(pa.int32()).cast(pa.date32())
start, end = dt.date(2000, 1, 1), dt.date(2000, 2, 1)
date_range(start, end, interval=7, closed="none")
<pyarrow.lib.Date32Array object at 0x00000296A74176A0>
[
2000-01-08,
2000-01-15,
2000-01-22,
2000-01-29
]What's the catch?We'd need to adapt It would be restricted to only |
| @unstable | ||
| def int_range( | ||
| start: int | Expr, | ||
| end: int | Expr | None = None, | ||
| step: int = 1, | ||
| *, | ||
| dtype: IntegerDType = Int64, | ||
| eager: IntoBackend[EagerAllowed] | Literal[False] = False, | ||
| ) -> Expr | Series[Any]: |
There was a problem hiding this comment.
Note
I don't think this blocks anything - just a realization from me 🙂
I hadn't been able to put my finger on what seemed off to me wrt this until just now:
Polars can do
eager: bool. In our case it's a bit more complex than that.
Instead of adding yet another argument to the function, I allowed foreagerto be False (default), None or the backend/implementation that should back the series.
This trick would work for us in all the cases where eager defines Expr | Series.
That is most of them, but there is polars.select as the ugly duckling with:
DataFrame | LazyFrame
@overload
def select(
*exprs: IntoExpr | Iterable[IntoExpr],
eager: Literal[True] = ...,
**named_exprs: IntoExpr,
) -> DataFrame: ...
@overload
def select(
*exprs: IntoExpr | Iterable[IntoExpr],
eager: Literal[False],
**named_exprs: IntoExpr,
) -> LazyFrame: ...
def select(
*exprs: IntoExpr | Iterable[IntoExpr], eager: bool = True, **named_exprs: IntoExpr
) -> DataFrame | LazyFrame:If we added select, then we'd need two arguments to be able to say whether we want pl.DataFrame or pl.LazyFrame
def select(
*exprs: IntoExpr | Iterable[IntoExpr],
backend: IntoBackend[Backend],
eager: bool = True,
**named_exprs: IntoExpr,
) -> DataFrame | LazyFrame: ...But that also seems a bit footgun-y, since something like select(..., backend="duckdb") would have a conflicting default.
Mostly just re-purposing `int_range` *so far*, as mentioned in (#2895 (comment)) Need to think about `Interval` some more
Mostly just re-purposing `int_range` *so far*, as mentioned in (#2895 (comment)) Need to think about `Interval` some more
May need tweaking pending (#2895 (comment))
What type of PR is this? (check all applicable)
Checklist
If you have comments or can explain your changes, please do so below
Related issue: #2722
I am slightly concerned that this could turn out to be a month long PR. I will open it as draft for now. There are a couple of pain points I already know:
I was very careful with type hints, yet it somehow doesn't pass type checkerFixedeager: bool. In our case it's a bit more complex than that. Instead of adding yet another argument to the function, I allowed foreagerto be False (default), None or the backend/implementation that should back the series.