polarstation

title: polarstation jupyter: python3 —

Tidy helper functions for Polars, inspired by the R tidyverse.

Installation

pip install polarstation

or with uv:

uv add polarstation

Quick start

import polars as pl
import polarstation   # registers extension functions for polars

df = pl.DataFrame({
    "animal": ["dog", "dog", None, "bird", "cow" , "bird", "bird"],
    "weight": [12.2, 8.1, 7.5, 0.5, 460, 0.4, None],
}).ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.reorder(by='weight')
)
print(df)
print(df['animal'].dtype)

shape: (7, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ ---    ┆ ---    │
│ enum   ┆ f64    │
╞════════╪════════╡
│ dog    ┆ 12.2   │
│ dog    ┆ 8.1    │
│ null   ┆ 7.5    │
│ bird   ┆ 0.5    │
│ cow    ┆ 460.0  │
│ bird   ┆ 0.4    │
│ bird   ┆ null   │
└────────┴────────┘
Enum(categories=['bird', 'dog', 'cow'])

ps.with_columns is a drop-in replacement for with_columns from polars that can handle some additional use cases like functions that need to peek at the full data for evaluation. It works efficiently on both DataFrame and LazyFrame.

Details

The key idea is FrameExpr — an expression that needs a peek at the data (schema or a small aggregation) before it resolves into a regular Polars expression. This unlocks operations like deriving Enum categories from the data, lumping rare levels, or reordering factor levels by a summary statistic, while keeping the rest of your pipeline lazy.

How FrameExpr stays efficient

ps.with_columns resolves each FrameExpr in two phases. First it runs a small aggregation (e.g. unique().sort() to discover categories) against the current lazy plan — so any preceding .filter() or .select() is already embedded and Polars’ predicate/projection pushdown keeps the peek cheap. Then it uses the result to build a concrete pl.Expr (e.g. .cast(pl.Enum(["a", "b", "c"]))) that goes back into the lazy plan and executes normally.

# Only the filtered rows are scanned for category discovery;
# the cast itself remains lazy.
lf = pl.scan_parquet("events.parquet")
result = (
    lf.filter(pl.col("country") == "DE")
      .ps.with_columns(pl.col("status").ps_enum.make())
      .filter(pl.col("status") == "active")
      .collect()
)

See the FrameExpr docstring for the full explanation, including when the peek is larger and notes on parallel evaluation.

Calling arbitrary functions

Sometimes there’s no Polars expression for what you need. ps.F, ps.B, and ps.E wrap arbitrary functions (numpy, scipy, plain Python, …) so they can be called directly on expressions, in place of hand-rolled pl.struct(...).map_batches(...) /map_elements(...).

ps.F is the right default whenever a function needs to see the complete input, not a sample or a batch — clustering is the clearest example. scipy’s fclusterdata takes every point at once and assigns cluster labels; there’s no Polars equivalent, and critically, it cannot be computed correctly on a slice of the data. Only the pl.Expr argument (pl.concat_arr("x", "y")) is resolved against the data — t and criterion are forwarded to fclusterdata unchanged:

import numpy as np
import polars as pl
import polarstation as ps
from scipy.cluster.hierarchy import fclusterdata

df = pl.DataFrame({
    "region": ["A", "A", "A", "A", "B", "B", "B", "B"],
    "x": [0.0, 0.0, 0.0, 0.0, 100.0, 100.0, 100.0, 100.0],
    "y": [0.0, 0.2, 9.8, 10.0, 0.0, 0.2, 9.8, 10.0],
})

df.ps.with_columns(
    cluster=ps.F(fclusterdata)(pl.concat_arr("x", "y"), t=2, criterion="maxclust")
)

shape: (8, 4)
┌────────┬───────┬──────┬─────────┐
│ region ┆ x     ┆ y    ┆ cluster │
│ ---    ┆ ---   ┆ ---  ┆ ---     │
│ str    ┆ f64   ┆ f64  ┆ i32     │
╞════════╪═══════╪══════╪═════════╡
│ A      ┆ 0.0   ┆ 0.0  ┆ 1       │
│ A      ┆ 0.0   ┆ 0.2  ┆ 1       │
│ A      ┆ 0.0   ┆ 9.8  ┆ 1       │
│ A      ┆ 0.0   ┆ 10.0 ┆ 1       │
│ B      ┆ 100.0 ┆ 0.0  ┆ 2       │
│ B      ┆ 100.0 ┆ 0.2  ┆ 2       │
│ B      ┆ 100.0 ┆ 9.8  ┆ 2       │
│ B      ┆ 100.0 ┆ 10.0 ┆ 2       │
└────────┴───────┴──────┴─────────┘

ps.B is the right choice when fn genuinely doesn’t care about batching (e.g., np.logaddexp (the numerically-stable way to compute log(exp(a) + exp(b)) for which there is no equivalent in polars).

df2 = pl.DataFrame({"log_p": [-0.5, -3.0, -10.0], "log_q": [-1.2, -0.4, -9.5]})
df2.lazy().with_columns(
    combined=ps.B(np.logaddexp)(pl.col("log_p"), pl.col("log_q"))
).collect()

shape: (3, 3)
┌───────┬───────┬───────────┐
│ log_p ┆ log_q ┆ combined  │
│ ---   ┆ ---   ┆ ---       │
│ f64   ┆ f64   ┆ f64       │
╞═══════╪═══════╪═══════════╡
│ -0.5  ┆ -1.2  ┆ -0.096814 │
│ -3.0  ┆ -0.4  ┆ -0.328355 │
│ -10.0 ┆ -9.5  ┆ -9.025923 │
└───────┴───────┴───────────┘

ps.E is for functions that only accept scalars, not arrays at all — like a hand-rolled edit distance, useful for catching typos against a reference list:

def levenshtein(a, b):
    if len(a) < len(b):
        a, b = b, a
    prev = list(range(len(b) + 1))
    for i, ca in enumerate(a, 1):
        curr = [i] + [0] * len(b)
        for j, cb in enumerate(b, 1):
            curr[j] = min(prev[j] + 1, curr[j - 1] + 1, prev[j - 1] + (ca != cb))
        prev = curr
    return prev[-1]

df3 = pl.DataFrame({"typed": ["aplpe", "bananna", "orange"], "correct": ["apple", "banana", "orange"]})
df3.with_columns(dist=ps.E(levenshtein)(pl.col("typed"), pl.col("correct")))

shape: (3, 3)
┌─────────┬─────────┬──────┐
│ typed   ┆ correct ┆ dist │
│ ---     ┆ ---     ┆ ---  │
│ str     ┆ str     ┆ i64  │
╞═════════╪═════════╪══════╡
│ aplpe   ┆ apple   ┆ 2    │
│ bananna ┆ banana  ┆ 1    │
│ orange  ┆ orange  ┆ 0    │
└─────────┴─────────┴──────┘

Dev Notes

To re-render the README.md run

quarto render README.qmd --to gfm

To build the documentation run:

uv run quarto render

and then in a separate terminal

uv run quarto preview

To update the documentation at https://const-ae.github.io/polarstation/ `uv run quarto publish gh-pages

To upload to pypi run

uv build
uv publish

Acknowledgements

This package stands on the shoulders of several excellent projects:

The tidyverse team for establishing the tidy data philosophy and the vocabulary that shapes this package’s design.
Hadley Wickham and the forcats authors for the factor-manipulation functions that directly inspired the ps_enum namespace.
David Hugh-Jones for santoku, which inspired the ps_chop functions.
Allison Horst, Alison Hill, and Kristen Gorman for the palmerpenguins dataset used in the examples and walkthrough.

License

MIT