polarstation

title: polarstation jupyter: python3 —

Tidy helper functions for Polars, inspired by the R tidyverse.

Installation

pip install polarstation

or with uv:

uv add polarstation

Quick start

import polars as pl
import polarstation   # registers extension functions for polars

df = pl.DataFrame({
    "animal": ["dog", "dog", None, "bird", "cow" , "bird", "bird"],
    "weight": [12.2, 8.1, 7.5, 0.5, 460, 0.4, None],
}).ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.reorder(by='weight')
)
print(df)
print(df['animal'].dtype)
shape: (7, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ ---    ┆ ---    │
│ enum   ┆ f64    │
╞════════╪════════╡
│ dog    ┆ 12.2   │
│ dog    ┆ 8.1    │
│ null   ┆ 7.5    │
│ bird   ┆ 0.5    │
│ cow    ┆ 460.0  │
│ bird   ┆ 0.4    │
│ bird   ┆ null   │
└────────┴────────┘
Enum(categories=['bird', 'dog', 'cow'])

ps.with_columns is a drop-in replacement for with_columns from polars that can handle some additional use cases like functions that need to peek at the full data for evaluation. It works efficiently on both DataFrame and LazyFrame.

Details

The key idea is FrameExpr — an expression that needs a peek at the data (schema or a small aggregation) before it resolves into a regular Polars expression. This unlocks operations like deriving Enum categories from the data, lumping rare levels, or reordering factor levels by a summary statistic, while keeping the rest of your pipeline lazy.

How FrameExpr stays efficient

ps.with_columns resolves each FrameExpr in two phases. First it runs a small aggregation (e.g. unique().sort() to discover categories) against the current lazy plan — so any preceding .filter() or .select() is already embedded and Polars’ predicate/projection pushdown keeps the peek cheap. Then it uses the result to build a concrete pl.Expr (e.g. .cast(pl.Enum(["a", "b", "c"]))) that goes back into the lazy plan and executes normally.

# Only the filtered rows are scanned for category discovery;
# the cast itself remains lazy.
lf = pl.scan_parquet("events.parquet")
result = (
    lf.filter(pl.col("country") == "DE")
      .ps.with_columns(pl.col("status").ps_enum.make())
      .filter(pl.col("status") == "active")
      .collect()
)

See the FrameExpr docstring for the full explanation, including when the peek is larger and notes on parallel evaluation.

Dev Notes

To build the documentation run:

uv run quarto render

and then in a separate terminal

uv run quarto preview

To re-render the README.md run

quarto render README.qmd --to gfm

To upload to pypi run

uv build
uv publish

Acknowledgements

This package stands on the shoulders of several excellent projects:

  • The tidyverse team for establishing the tidy data philosophy and the vocabulary that shapes this package’s design.
  • Hadley Wickham and the forcats authors for the factor-manipulation functions that directly inspired the ps_enum namespace.
  • David Hugh-Jones for santoku, which inspired the ps_chop functions.
  • Allison Horst, Alison Hill, and Kristen Gorman for the palmerpenguins dataset used in the examples and walkthrough.

License

MIT