Skip to content

Update buscar's pipelines using the Lazy Dataframe object #43

@axiomcura

Description

@axiomcura

Would it make sense to try a lazy interpretation of this?

        scores_df.lazy().group_by("treatment")
        .agg(
            (
                (pl.col("on_score") * pl.col("ratio")).sum()
                + (pl.col("off_score") * pl.col("ratio")).sum()
            ).alias("compound_score")
        )
        .sort("compound_score").collect()
    )

Originally posted by @d33bs in #42 (comment)

Eager DataFrame pipelines in Polars can consume large amounts of memory by materializing intermediate results at each step, especially when chaining transformations such as sorts, joins, and groupbys. This often leads to memory pressure and reduced performance in multi-step data-wrangling code. To address this, the lazy API should be used to let Polars optimize and fuse operations, push down filters, and avoid unnecessary materialization, thereby improving both speed and memory efficiency. For small, one-off operations, eager execution remains appropriate, but for large datasets or multi-step pipelines, a practical pattern is to have functions accept and return LazyFrame objects, invoking .collect() only at the pipeline boundary to maximize optimization and compose complex workflows more effectively.

Really good explanation of LazyDataFrames in polars:
https://stackoverflow.com/a/76612637

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions