Frame helpers

Overview

Frame helpers

Methods on the .ps namespace, available on both DataFrame and LazyFrame.

`df.ps.with_columns`	Like df.with_columns, but also accepts FrameExpr and multi-column selectors.
`df.ps.select`	Like df.select, but also accepts FrameExpr.
`Expr.ps.apply`	Apply a custom function with full LazyFrame context.

ps_enum — Enum column helpers

Methods on the .ps_enum expression namespace for working with categorical / Enum columns.

`Expr.ps_enum.make`	Cast a column to Enum, optionally deriving categories from the data.
`Expr.ps_enum.lump`	Collapse infrequent categories into `other_label`.
`Expr.ps_enum.rename`	Rename categories, leaving any not present in the mapping unchanged.
`Expr.ps_enum.rev`	Reverse the order of categories.
`Expr.ps_enum.infreq`	Reorder categories by frequency, most frequent first.
`Expr.ps_enum.reorder`	Reorder categories by an aggregation of one or more columns within each group.
`Expr.ps_enum.set_categories`	Set the exact category list. Values not in `categories` become null.
`Expr.ps_enum.unify`	Give all matched Enum columns the same category set — the union of all their levels.
`Expr.ps_enum.add_categories`	Insert new categories without changing any values.
`Expr.ps_enum.move`	Move specified categories to a given position, keeping all others in their relative order.
`Expr.ps_enum.drop_unused`	Remove categories that don’t appear in the data, preserving order.
`Expr.ps_enum.missing_to_category`	Convert null values into a new category `name`, appended at the end.
`Expr.ps_enum.category_to_missing`	Convert all occurrences of one or more categories to null and remove them from the Enum.

ps_chop — Binning helpers

Methods on the .ps_chop expression namespace for cutting a column into labelled intervals.

`Expr.ps_chop.chop`	Cut into intervals at explicit breakpoints.
`Expr.ps_chop.width`	Chop into equal-width bins of given size.
`Expr.ps_chop.n_elements`	Chop into groups of n observations each.
`Expr.ps_chop.n_groups`	Chop into k equal-count groups (by quantile boundaries).
`Expr.ps_chop.quantiles`	Chop at quantile boundaries.

ps_str — String column helpers

Methods on the .ps_str expression namespace.

`Expr.ps_str.format`	Format this column’s values into `template`, the way `str.format` formats a value.
`Expr.ps_str.count`	Count non-overlapping regex matches in each string.
`Expr.ps_str.wrap`	Wrap each string to at most `width` characters per line.
`Expr.ps_str.trunc`	Truncate each string to fit within `width` characters.

Function helpers

Top-level ps.* functions for calling arbitrary functions on expressions and formatting strings.

`ps.F`	Turn an arbitrary function into something callable on expressions, with real data.
`ps.B`	Turn an arbitrary vectorized function into something callable on expressions, lazily.
`ps.E`	Turn a scalar (non-vectorized) Python function into something callable on expressions.
`ps.format`	Format columns into a string, the way `str.format` formats values.
`ps.fmt_col`	Mark a column for embedding inside a real f-string, for a following `ps.format(...)`.

Internals

Building blocks for writing custom FrameExpr.

FrameExpr An expression that requires a LazyFrame context to resolve into a list of pl.Expr.

df.ps.with_columns

df.ps.with_columns(
    *exprs,
    **named_exprs,
)

Like df.with_columns, but also accepts FrameExpr and multi-column selectors.

*exprs = ()

**named_exprs = {}

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.reorder(by="weight")
)

shape: (5, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ ---    ┆ ---    │
│ enum   ┆ f64    │
╞════════╪════════╡
│ dog    ┆ 12.2   │
│ null   ┆ 7.5    │
│ bird   ┆ 0.5    │
│ cow    ┆ 460.0  │
│ bird   ┆ null   │
└────────┴────────┘

df.ps.select

df.ps.select(
    *exprs,
    **named_exprs,
)

Like df.select, but also accepts FrameExpr.

*exprs = ()

**named_exprs = {}

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.select(pl.col("animal").ps_enum.make(), "weight")

shape: (5, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ ---    ┆ ---    │
│ enum   ┆ f64    │
╞════════╪════════╡
│ dog    ┆ 12.2   │
│ null   ┆ 7.5    │
│ bird   ┆ 0.5    │
│ cow    ┆ 460.0  │
│ bird   ┆ null   │
└────────┴────────┘

Expr.ps.apply

Expr.ps.apply(
    fn,
)

Apply a custom function with full LazyFrame context.

fn Callable[[pl.LazyFrame, pl.Expr], pl.Expr]: Called as fn(lf, col_ref) → pl.Expr for each matched column. col_ref evaluates to the column’s values in lf; it works correctly for any expression shape, including transforms and when/then/otherwise. Use col_ref.meta.output_name() when the string column name is needed.

Examples:

def center_scale(lf: pl.LazyFrame, col_ref: pl.Expr) -> pl.Expr:
    stats = lf.select(
        col_ref.mean().alias("m"), col_ref.std().alias("s")
    ).collect()
    m, s = stats["m"][0], stats["s"][0]
    return (col_ref - m) / s

pl.DataFrame({"x": [1.0, 2.0, 3.0, 4.0, 5.0]}).ps.with_columns(
    pl.col("x").ps.apply(center_scale)
)

shape: (5, 1)
┌───────────┐
│ x         │
│ ---       │
│ f64       │
╞═══════════╡
│ -1.264911 │
│ -0.632456 │
│ 0.0       │
│ 0.632456  │
│ 1.264911  │
└───────────┘

import math

df = pl.DataFrame({
    "doc_id": [1, 1, 2, 2, 2],
    "term": ["cat", "dog", "cat", "cat", "bird"],
})

def idf(lf: pl.LazyFrame, col_ref: pl.Expr) -> pl.Expr:
    col_name = col_ref.meta.output_name()
    n = lf.select(pl.col("doc_id").n_unique()).collect().item()
    freq = lf.group_by(col_ref).agg(
        pl.col("doc_id").n_unique().alias("n")
    ).collect()
    scores = {r[col_name]: math.log(n / r["n"]) for r in freq.iter_rows(named=True)}
    return col_ref.replace_strict(
        list(scores), list(scores.values()), return_dtype=pl.Float64
    )

df.ps.with_columns(pl.col("term").ps.apply(idf).alias("idf"))

shape: (5, 3)
┌────────┬──────┬──────────┐
│ doc_id ┆ term ┆ idf      │
│ ---    ┆ ---  ┆ ---      │
│ i64    ┆ str  ┆ f64      │
╞════════╪══════╪══════════╡
│ 1      ┆ cat  ┆ 0.0      │
│ 1      ┆ dog  ┆ 0.693147 │
│ 2      ┆ cat  ┆ 0.0      │
│ 2      ┆ cat  ┆ 0.0      │
│ 2      ┆ bird ┆ 0.693147 │
└────────┴──────┴──────────┘

ps_enum — Enum column helpers

Expr.ps_enum.make

Expr.ps_enum.make(
    categories=None,
    make_null=(),
)

Cast a column to Enum, optionally deriving categories from the data.

When categories are derived from the data, they are sorted by the column’s native dtype before being cast to string. This means integers sort numerically (1, 2, 10), dates chronologically, and strings alphabetically — rather than all sorting lexicographically.

categories Sequence[str] | None: Fixed set of allowed values. If omitted, derived from the data as the unique values sorted by native dtype order.
make_null Sequence[str] | str = (): Values to replace with null before casting.

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(pl.col("animal").ps_enum.make())

shape: (5, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ ---    ┆ ---    │
│ enum   ┆ f64    │
╞════════╪════════╡
│ dog    ┆ 12.2   │
│ null   ┆ 7.5    │
│ bird   ┆ 0.5    │
│ cow    ┆ 460.0  │
│ bird   ┆ null   │
└────────┴────────┘

pl.DataFrame({"x": ["a", "b", "?"]}).ps.with_columns(
    pl.col("x").ps_enum.make(categories=["a", "b", "z"], make_null="?")
)['x'].dtype

Enum(categories=['a', 'b', 'z'])

Expr.ps_enum.lump

Expr.ps_enum.lump(
    n=5,
    other_label='Other',
    lump_fn=None,
)

Collapse infrequent categories into other_label.

By default keeps the top-n most frequent categories and collapses the rest. Pass lump_fn to use a custom rule instead (in which case n is ignored).

The order of the categories remains unchanged with other_label appended at the end.

The function also accepts a String or Categorical column, in addition to Enum, in which case ps_enum.make() is called first.

n int = 5: Number of categories to keep (ignored when lump_fn is provided).
other_label str = 'Other': Label for the collapsed category.
lump_fn Callable[[pl.DataFrame], Iterable[bool]] | None: Optional callable that receives the non-null counts DataFrame (columns: category column + "n", sorted by frequency descending) and returns a boolean sequence where True marks categories to collapse.

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.lump(n=1)
)

shape: (5, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ ---    ┆ ---    │
│ enum   ┆ f64    │
╞════════╪════════╡
│ Other  ┆ 12.2   │
│ null   ┆ 7.5    │
│ bird   ┆ 0.5    │
│ Other  ┆ 460.0  │
│ bird   ┆ null   │
└────────┴────────┘

Expr.ps_enum.rename

Expr.ps_enum.rename(
    mapping,
    strict=True,
)

Rename categories, leaving any not present in the mapping unchanged.

The function also accepts a String or Categorical column, in addition to Enum, in which case ps_enum.make() is called first.

mapping Mapping[str, str] | Callable[[str], str]: A dict of old → new names, or a callable applied to each category name.
strict bool = True: If True (default), raise if any dict key is not an existing category.

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.rename({"bird": "Bird", "cow": "Cow"})
)

shape: (5, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ ---    ┆ ---    │
│ enum   ┆ f64    │
╞════════╪════════╡
│ dog    ┆ 12.2   │
│ null   ┆ 7.5    │
│ Bird   ┆ 0.5    │
│ Cow    ┆ 460.0  │
│ Bird   ┆ null   │
└────────┴────────┘

animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.rename(str.upper)
)

shape: (5, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ ---    ┆ ---    │
│ enum   ┆ f64    │
╞════════╪════════╡
│ DOG    ┆ 12.2   │
│ null   ┆ 7.5    │
│ BIRD   ┆ 0.5    │
│ COW    ┆ 460.0  │
│ BIRD   ┆ null   │
└────────┴────────┘

Expr.ps_enum.rev

Expr.ps_enum.rev()

Reverse the order of categories.

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.rev()
)["animal"].dtype

Enum(categories=['dog', 'cow', 'bird'])

Expr.ps_enum.infreq

Expr.ps_enum.infreq(
    descending=False,
)

Reorder categories by frequency, most frequent first.

The function also accepts a String or Categorical column, in addition to Enum, in which case ps_enum.make() is called first.

descending bool = False: If True, least frequent first instead.

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.infreq()
)["animal"].dtype

Enum(categories=['bird', 'cow', 'dog'])

Expr.ps_enum.reorder

Expr.ps_enum.reorder(
    by,
    agg=pl.Expr.median,
    descending=False,
    nulls_last=False,
    missing='drop',
)

Reorder categories by an aggregation of one or more columns within each group.

The function also accepts a String or Categorical column, in addition to Enum, in which case ps_enum.make() is called first.

by IntoExpr | Iterable[IntoExpr]: Column(s) to aggregate per category for ordering. Strings are treated as column names.
agg Callable[[pl.Expr], pl.Expr] = pl.Expr.median: Aggregation applied to each by column (default: median).
descending bool | Sequence[bool] = False: Sort descending. A single bool applies to all columns.
nulls_last bool | Sequence[bool] = False: Place null aggregates last. A single bool applies to all columns.
missing Literal['drop', 'last', 'first'] = 'drop': How to handle categories whose aggregate is null — ‘drop’ excludes them, ‘last’ appends them, ‘first’ prepends them.

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.reorder("weight", agg=pl.Expr.mean)
)["animal"].dtype

Enum(categories=['bird', 'dog', 'cow'])

With .over(): each group’s order is computed from its own data, then categories are unioned into one shared Enum’s declared category list — so a value’s level (chained via .ps_enum.to_level()) is always locally correct, but the Enum dtype’s own declared category order is one overall order, not any single group’s.

Expr.ps_enum.set_categories

Expr.ps_enum.set_categories(
    categories,
)

Set the exact category list. Values not in categories become null.

categories Sequence[str]: The new ordered category list.

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.set_categories(["cow", "dog"])
)

shape: (5, 2)
┌────────┬────────┐
│ animal ┆ weight │
│ ---    ┆ ---    │
│ enum   ┆ f64    │
╞════════╪════════╡
│ dog    ┆ 12.2   │
│ null   ┆ 7.5    │
│ null   ┆ 0.5    │
│ cow    ┆ 460.0  │
│ null   ┆ null   │
└────────┴────────┘

Expr.ps_enum.unify

Expr.ps_enum.unify()

Give all matched Enum columns the same category set — the union of all their levels.

Categories are ordered by first appearance across columns (left to right). Values are never changed; only the dtype gains the extra categories.

Requires all matched columns to already be Enum. Call .ps_enum.make() first if needed.

Examples:

df = pl.DataFrame({
    'x': pl.Series(['a', 'b'], dtype=pl.Enum(['a', 'b'])),
    'y': pl.Series(['b', 'c'], dtype=pl.Enum(['b', 'c'])),
})
df.ps.with_columns(pl.col('x', 'y').ps_enum.unify())['x'].dtype

Enum(categories=['a', 'b', 'c'])

Expr.ps_enum.add_categories

Expr.ps_enum.add_categories(
    categories,
    before=None,
)

Insert new categories without changing any values.

categories Sequence[str]: New category labels to add.
before int | None: Insert before this 0-based index of the existing categories. None (default) appends at the end. Negative indices count from the end. Any value ≥ len(categories) is equivalent to None (end).

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.add_categories(["rabbit"], before=1)
)["animal"].dtype

Enum(categories=['bird', 'rabbit', 'cow', 'dog'])

Expr.ps_enum.move

Expr.ps_enum.move(
    *levels,
    before=0,
)

Move specified categories to a given position, keeping all others in their relative order.

*levels str = ()

before int | None = 0: Insert before this 0-based index of the remaining categories. 0 (default) moves to the front. None appends at the end. Negative indices count from the end of the remaining categories. Any value ≥ len(remaining) is equivalent to None (end).

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.move("dog")
)["animal"].dtype

Enum(categories=['dog', 'bird', 'cow'])

Expr.ps_enum.drop_unused

Expr.ps_enum.drop_unused()

Remove categories that don’t appear in the data, preserving order.

Examples:

df = pl.DataFrame(
    {'x': pl.Series('x', ['bird', 'bird'], dtype=pl.Enum(['fish', 'bird', 'cat']))}
)
df.ps.with_columns(pl.col("x").ps_enum.drop_unused())["x"].dtype

Enum(categories=['bird'])

Expr.ps_enum.missing_to_category

Expr.ps_enum.missing_to_category(
    name,
)

Convert null values into a new category name, appended at the end.

If name is already a category, null values are mapped to the existing category without modifying the category list.

name str: Label for the category to assign to null values.

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    new_animals = pl.col("animal").ps_enum.make().ps_enum.missing_to_category("unknown")
)

shape: (5, 3)
┌────────┬────────┬─────────────┐
│ animal ┆ weight ┆ new_animals │
│ ---    ┆ ---    ┆ ---         │
│ str    ┆ f64    ┆ enum        │
╞════════╪════════╪═════════════╡
│ dog    ┆ 12.2   ┆ dog         │
│ null   ┆ 7.5    ┆ unknown     │
│ bird   ┆ 0.5    ┆ bird        │
│ cow    ┆ 460.0  ┆ cow         │
│ bird   ┆ null   ┆ bird        │
└────────┴────────┴─────────────┘

Expr.ps_enum.category_to_missing

Expr.ps_enum.category_to_missing(
    name,
)

Convert all occurrences of one or more categories to null and remove them from the Enum.

name str | Sequence[str]: Category name(s) to nullify. Raises if any are not current categories.

Examples:

animals = polarstation.make_example_data("animals")
animals.ps.with_columns(
    new_animals =  pl.col("animal").ps_enum.make().ps_enum.category_to_missing("bird")
)

shape: (5, 3)
┌────────┬────────┬─────────────┐
│ animal ┆ weight ┆ new_animals │
│ ---    ┆ ---    ┆ ---         │
│ str    ┆ f64    ┆ enum        │
╞════════╪════════╪═════════════╡
│ dog    ┆ 12.2   ┆ dog         │
│ null   ┆ 7.5    ┆ null        │
│ bird   ┆ 0.5    ┆ null        │
│ cow    ┆ 460.0  ┆ cow         │
│ bird   ┆ null   ┆ null        │
└────────┴────────┴─────────────┘

ps_chop — Binning helpers

Expr.ps_chop.chop

Expr.ps_chop.chop(
    breaks,
    labels=None,
    left_closed=True,
    fmt=None,
    extend=True,
    return_struct=False,
)

Cut into intervals at explicit breakpoints.

Returns an Enum-typed column whose category names are the bin labels. Integer columns use fully-closed [a, b] notation; single-element bins are written as {x}.

breaks Sequence[Any]: Interior breakpoints; sorted automatically. Accepts numeric, string, or temporal Python values (datetime, date, timedelta, time).
labels Sequence[str] | None: Category labels (must be len(breaks) + 1). Auto-generated if omitted.
left_closed bool = True: If True (default), intervals are [lo, hi); otherwise (lo, hi].
fmt str | Callable | None: Formatter for auto-generated labels. For numeric, a format-spec string (e.g. “.2f”) or callable. For temporal, a callable or None (uses str()).
extend bool = True: For numeric only — if True (default), outermost labels extend to -∞/+∞. For unsigned integers: 0/+∞. If False, uses data min/max. Temporal breaks always use data bounds regardless of this setting.
return_struct bool = False: If True, return a struct {lo, hi} instead of just the label.

Examples:

scores = polarstation.make_example_data("scores")
scores.ps.with_columns(
    pl.col("score").ps_chop.chop([40, 70], fmt=".0f").alias("grade")
)

shape: (7, 2)
┌───────┬──────────┐
│ score ┆ grade    │
│ ---   ┆ ---      │
│ i64   ┆ enum     │
╞═══════╪══════════╡
│ 12    ┆ (-∞, 39] │
│ 45    ┆ [40, 69] │
│ 67    ┆ [40, 69] │
│ 89    ┆ [70, +∞) │
│ 95    ┆ [70, +∞) │
│ 23    ┆ (-∞, 39] │
│ 78    ┆ [70, +∞) │
└───────┴──────────┘

Expr.ps_chop.width

Expr.ps_chop.width(
    size,
    start=None,
    labels=None,
    left_closed=True,
    fmt=None,
    extend=False,
    return_struct=False,
)

Chop into equal-width bins of given size.

Returns an Enum-typed column whose category names are the bin labels.

size float | _dt.timedelta: Width of each bin. For numeric columns, a number. For temporal columns, a datetime.timedelta.
start Any | None: Left edge of the first bin. Defaults to the column minimum (or 0 for unsigned integer columns).
labels Sequence[str] | None: Category labels. Auto-generated if omitted.
left_closed bool = True: If True (default), intervals are [lo, hi); otherwise (lo, hi].
fmt str | Callable | None: Formatter for auto-generated labels. For numeric, a format-spec string or callable; defaults to “g”. For temporal, a callable or None (uses str()).
extend bool = False: If True, extend outermost labels to -∞ / +∞. If False (default), the first label opens at the anchor and the last closes at anchor + n_bins * size.
return_struct bool = False: If True, return a struct instead of just the label.

Examples:

scores = polarstation.make_example_data("scores")
scores.ps.with_columns(pl.col("score").ps_chop.width(25).alias("band"))

shape: (7, 2)
┌───────┬──────────┐
│ score ┆ band     │
│ ---   ┆ ---      │
│ i64   ┆ enum     │
╞═══════╪══════════╡
│ 12    ┆ [12, 36] │
│ 45    ┆ [37, 61] │
│ 67    ┆ [62, 86] │
│ 89    ┆ [87, 95] │
│ 95    ┆ [87, 95] │
│ 23    ┆ [12, 36] │
│ 78    ┆ [62, 86] │
└───────┴──────────┘

Expr.ps_chop.n_elements

Expr.ps_chop.n_elements(
    n,
    tail='split',
    labels=None,
    left_closed=True,
    fmt='g',
    extend=False,
    return_struct=False,
)

Chop into groups of n observations each.

Returns an Enum-typed column whose category names are the bin labels. Boundaries are drawn after every nth element (sorted order). Ties are never split — the boundary advances to the next distinct value if needed.

n int: Number of observations per group.
tail Literal['split', 'merge'] = 'split': What to do when the total doesn’t divide evenly. “split” (default) keeps the smaller final group; “merge” absorbs it into the preceding group.
labels Sequence[str] | None: Category labels. Auto-generated if omitted.
left_closed bool = True: If True (default), intervals are [lo, hi); otherwise (lo, hi].
fmt str | Callable[[float], str] = 'g': Number formatter for auto-generated labels (numeric columns only).
extend bool = False: If True, extend outermost labels to -∞ / +∞ (or 0 / +∞ for unsigned integers). If False (default), the first label opens at the data minimum and the last closes at the data maximum.
return_struct bool = False: If True, return a struct instead of just the label.

Examples:

scores = polarstation.make_example_data("scores")
scores.ps.with_columns(pl.col("score").ps_chop.n_elements(3).alias("tercile"))

shape: (7, 2)
┌───────┬──────────┐
│ score ┆ tercile  │
│ ---   ┆ ---      │
│ i64   ┆ enum     │
╞═══════╪══════════╡
│ 12    ┆ [12, 66] │
│ 45    ┆ [12, 66] │
│ 67    ┆ [67, 94] │
│ 89    ┆ [67, 94] │
│ 95    ┆ {95}     │
│ 23    ┆ [12, 66] │
│ 78    ┆ [67, 94] │
└───────┴──────────┘

Expr.ps_chop.n_groups

Expr.ps_chop.n_groups(
    k,
    labels=None,
    left_closed=True,
    fmt=None,
    raw=True,
    extend=False,
    return_struct=False,
)

Chop into k equal-count groups (by quantile boundaries).

Returns an Enum-typed column whose category names are the bin labels.

k int: Number of groups.
labels Sequence[str] | None: Category labels (must be k). Auto-generated if omitted.
left_closed bool = True: If True (default), intervals are [lo, hi); otherwise (lo, hi].
fmt str | Callable | None: Formatter for auto-generated labels. For numeric, defaults to “g” when raw=True and “.0%” when raw=False. For temporal, a callable or None.
raw bool = True: If True (default), label with the actual break values. If False, use percentage labels (e.g. [0%, 25%)). Ignored for temporal columns.
extend bool = False: If True, extend outermost labels to -∞ / +∞ (only affects numeric raw=True). Default False. For unsigned columns, lower bound is 0.
return_struct bool = False: If True, return a struct instead of just the label.

Examples:

scores = polarstation.make_example_data("scores")
scores.ps.with_columns(pl.col("score").ps_chop.n_groups(3).alias("tertile"))

shape: (7, 2)
┌───────┬──────────┐
│ score ┆ tertile  │
│ ---   ┆ ---      │
│ i64   ┆ enum     │
╞═══════╪══════════╡
│ 12    ┆ [12, 44] │
│ 45    ┆ [45, 77] │
│ 67    ┆ [45, 77] │
│ 89    ┆ [78, 95] │
│ 95    ┆ [78, 95] │
│ 23    ┆ [12, 44] │
│ 78    ┆ [78, 95] │
└───────┴──────────┘

Expr.ps_chop.quantiles

Expr.ps_chop.quantiles(
    probs,
    labels=None,
    left_closed=True,
    fmt=None,
    raw=False,
    extend=False,
    return_struct=False,
)

Chop at quantile boundaries.

Returns an Enum-typed column whose category names are the bin labels.

probs Sequence[float]: Quantile probabilities in (0, 1), e.g. [0.25, 0.5, 0.75] for quartiles.
labels Sequence[str] | None: Category labels (must be len(probs) + 1). Auto-generated if omitted.
left_closed bool = True: If True (default), intervals are [lo, hi); otherwise (lo, hi].
fmt str | Callable | None: Formatter for auto-generated labels. For numeric, defaults to “.0%” (percentages) when raw=False and “g” when raw=True. For temporal, a callable or None (uses str()).
raw bool = False: If True, label with the actual break values instead of percentages. Ignored for temporal columns (always uses actual values).
extend bool = False: If True, extend outermost labels to -∞ / +∞ (only affects numeric raw=True). Default False. For unsigned columns, lower bound is 0.
return_struct bool = False: If True, return a struct instead of just the label.

Examples:

scores = polarstation.make_example_data("scores")
scores.ps.with_columns(
    pl.col("score").ps_chop.quantiles([0.25, 0.75]).alias("iqr_group")
)

shape: (7, 2)
┌───────┬─────────────┐
│ score ┆ iqr_group   │
│ ---   ┆ ---         │
│ i64   ┆ enum        │
╞═══════╪═════════════╡
│ 12    ┆ [0%, 25%)   │
│ 45    ┆ [25%, 75%)  │
│ 67    ┆ [25%, 75%)  │
│ 89    ┆ [75%, 100%] │
│ 95    ┆ [75%, 100%] │
│ 23    ┆ [0%, 25%)   │
│ 78    ┆ [25%, 75%)  │
└───────┴─────────────┘

ps_str — String column helpers

Expr.ps_str.format

Expr.ps_str.format(
    template,
)

Format this column’s values into template, the way str.format formats a value.

For a plain (non-Struct) column, this is a one-argument shorthand for ps.format(template, self) — template contains exactly one {...} field, referring to this expression’s own values. For a Struct column, each field is instead unpacked as a named argument, keyed by its field name — template can then reference each one by name, the same way ps.format(template, **fields) would.

template str

Examples:

pl.DataFrame({"err": [0.5, 1.25, 12.0]}).ps.with_columns(
    msg=pl.col("err").ps_str.format("error={:.2f}")
)

shape: (3, 2)
┌──────┬─────────────┐
│ err  ┆ msg         │
│ ---  ┆ ---         │
│ f64  ┆ str         │
╞══════╪═════════════╡
│ 0.5  ┆ error=0.50  │
│ 1.25 ┆ error=1.25  │
│ 12.0 ┆ error=12.00 │
└──────┴─────────────┘

pl.DataFrame({"x": [1, 2], "y": [3.0, 4.0]}).ps.with_columns(
    msg=pl.struct(a="x", b="y").ps_str.format("a={a}, b={b:.1f}")
)

shape: (2, 3)
┌─────┬─────┬────────────┐
│ x   ┆ y   ┆ msg        │
│ --- ┆ --- ┆ ---        │
│ i64 ┆ f64 ┆ str        │
╞═════╪═════╪════════════╡
│ 1   ┆ 3.0 ┆ a=1, b=3.0 │
│ 2   ┆ 4.0 ┆ a=2, b=4.0 │
└─────┴─────┴────────────┘

Expr.ps_str.count

Expr.ps_str.count(
    pattern='',
)

Count non-overlapping regex matches in each string.

Deprecated: thin wrapper around pl.Expr.str.count_matches; likely to be removed.

pattern = ''

Examples:

pl.DataFrame({"x": ["hello world", "foo bar baz", ""]}).select(
    pl.col("x").ps_str.count(r"\b\w+\b").alias("word_count")
)

shape: (3, 1)
┌────────────┐
│ word_count │
│ ---        │
│ u32        │
╞════════════╡
│ 2          │
│ 3          │
│ 0          │
└────────────┘

Expr.ps_str.wrap

Expr.ps_str.wrap(
    width=80,
    initial_indent=0,
    subsequent_indent=0,
    break_on_hyphens=True,
    **kwargs,
)

Wrap each string to at most width characters per line.

width int = 80: Maximum line length.
initial_indent int = 0: Number of spaces prepended to the first line.
subsequent_indent int = 0: Number of spaces prepended to every subsequent line.
break_on_hyphens bool = True: Allow breaks at hyphens in compound words.

**kwargs Any = {}

Examples:

text = pl.DataFrame({"x": ["A long sentence that exceeds the column width."]}).select(
    pl.col("x").ps_str.wrap(width=25)
)['x'].to_list()
text

['A long sentence that\nexceeds the column width.']

Expr.ps_str.trunc

Expr.ps_str.trunc(
    width=5,
    side='right',
    placeholder='…',
)

Truncate each string to fit within width characters.

Collapses whitespace and appends placeholder when the text is cut.

width int = 5: Maximum length of the result, including the placeholder.
side Literal['right', 'left', 'center'] = 'right': Which side to truncate — ‘right’ (default), ‘left’, or ‘center’.
placeholder str = '…': String inserted where the text is cut.

Examples:

pl.DataFrame({"x": ["short", "a much longer string"]}).select(
    pl.col("x").ps_str.trunc(width=10)
)

shape: (2, 1)
┌────────────┐
│ x          │
│ ---        │
│ str        │
╞════════════╡
│ short      │
│ a much lo… │
└────────────┘

Function helpers

ps.F

ps.F(
    fn,
)

Turn an arbitrary function into something callable on expressions, with real data.

fn is called exactly once, eagerly, with the complete data (as pl.Series) for its pl.Expr arguments — never a sample, dummy data, or a batch/slice. That means no return_dtype is needed: the output dtype is whatever fn actually produced. This is the right default for arbitrary vectorized functions (numpy, scipy, …) that don’t otherwise fit Polars’ expression API.

Only pl.Expr arguments (positional or keyword) are resolved against the data; any other argument (a plain string, number, …) is forwarded to fn unchanged — useful for a function’s non-column parameters, e.g. a cluster count.

The cost is an eager .collect() of fn’s pl.Expr arguments at the point ps.F(fn)(...) is resolved (via ps.with_columns/ps.select), the same tradeoff ps_chop and ps_enum already make for operations that need to see real data. Preceding .filter()/.select() calls are still pushed down, but nothing after this point can narrow the collect retroactively.

For a function that must stay fully lazy (e.g. inside a .over() or streaming pipeline) and can tolerate being called on batches/slices instead of the full column, use ps.B instead. For a plain Python function that only accepts scalars (not arrays), use ps.E.

fn Callable

Examples:

import polars as pl
import polarstation as ps
from scipy.cluster.hierarchy import fclusterdata

df = pl.DataFrame({"x": [0.0, 0.0, 10.0, 10.0], "y": [0.0, 0.2, 0.0, 0.2]})
df.ps.with_columns(
    cluster=ps.F(fclusterdata)(pl.concat_arr("x", "y"), t=2, criterion="maxclust")
)

shape: (4, 3)
┌──────┬─────┬─────────┐
│ x    ┆ y   ┆ cluster │
│ ---  ┆ --- ┆ ---     │
│ f64  ┆ f64 ┆ i32     │
╞══════╪═════╪═════════╡
│ 0.0  ┆ 0.0 ┆ 1       │
│ 0.0  ┆ 0.2 ┆ 1       │
│ 10.0 ┆ 0.0 ┆ 2       │
│ 10.0 ┆ 0.2 ┆ 2       │
└──────┴─────┴─────────┘

ps.B

ps.B(
    fn,
    return_dtype=None,
    is_elementwise=False,
    **map_batches_kwargs,
)

Turn an arbitrary vectorized function into something callable on expressions, lazily.

Thin wrapper around pl.map_batches: fn’s pl.Expr arguments (positional or keyword) are resolved to pl.Series, but — unlike ps.F — it may be called more than once, and on batches/slices rather than the complete column (e.g. under streaming execution, or once per group inside .over()/group_by().agg()). In exchange, the result stays fully lazy: no eager collect is forced at the call site. As with ps.F, any non-pl.Expr argument is forwarded to fn unchanged.

If return_dtype is left unset, Polars infers it by calling fn once with synthetic dummy data — this can raise for domain-restricted functions, or infer the wrong dtype. Prefer ps.F unless you specifically need laziness/streaming and can either supply return_dtype or tolerate that inference step.

fn Callable

return_dtype pl.DataTypeExpr | pl.DataType | None

is_elementwise bool = False

**map_batches_kwargs = {}

Examples:

import numpy as np
import polars as pl
import polarstation as ps

df = pl.DataFrame({"log_p": [-0.5, -3.0], "log_q": [-1.2, -0.4]})
df.lazy().with_columns(
    combined=ps.B(np.logaddexp, return_dtype=pl.Float64)(pl.col("log_p"), pl.col("log_q"))
).collect()

shape: (2, 3)
┌───────┬───────┬───────────┐
│ log_p ┆ log_q ┆ combined  │
│ ---   ┆ ---   ┆ ---       │
│ f64   ┆ f64   ┆ f64       │
╞═══════╪═══════╪═══════════╡
│ -0.5  ┆ -1.2  ┆ -0.096814 │
│ -3.0  ┆ -0.4  ┆ -0.328355 │
└───────┴───────┴───────────┘

ps.E

ps.E(
    fn,
    return_dtype=None,
    **map_elements_kwargs,
)

Turn a scalar (non-vectorized) Python function into something callable on expressions.

fn is called once per row with plain Python scalars — the multi-argument equivalent of pl.Expr.map_elements, built on pl.struct(...).map_elements(...). As with ps.F/ps.B, only pl.Expr arguments (positional or keyword) become per-row values; any other argument is forwarded to fn unchanged. Note that skip_nulls (defaults to True, can be overridden via **map_elements_kwargs) only skips a row when the entire struct is null, which struct values built from columns essentially never are — a null in a single argument still reaches fn as None, so fn must be able to handle that itself if any argument column has nulls.

Unlike ps.F/ps.B, there is no vectorized fast path here — fn runs once per row. Only reach for ps.E when fn genuinely cannot operate on whole arrays at once (e.g. it calls into a scalar-only library). For anything that accepts numpy arrays or pl.Series directly, prefer ps.F.

fn Callable

return_dtype pl.DataTypeExpr | pl.DataType | None

**map_elements_kwargs = {}

Examples:

import polars as pl
import polarstation as ps

def levenshtein(a, b):
    if len(a) < len(b):
        a, b = b, a
    prev = list(range(len(b) + 1))
    for i, ca in enumerate(a, 1):
        curr = [i] + [0] * len(b)
        for j, cb in enumerate(b, 1):
            curr[j] = min(prev[j] + 1, curr[j - 1] + 1, prev[j - 1] + (ca != cb))
        prev = curr
    return prev[-1]

df = pl.DataFrame({"typed": ["aplpe", "bananna"], "correct": ["apple", "banana"]})
df.with_columns(
    dist=ps.E(levenshtein, return_dtype=pl.Int64)(pl.col("typed"), pl.col("correct"))
)

shape: (2, 3)
┌─────────┬─────────┬──────┐
│ typed   ┆ correct ┆ dist │
│ ---     ┆ ---     ┆ ---  │
│ str     ┆ str     ┆ i64  │
╞═════════╪═════════╪══════╡
│ aplpe   ┆ apple   ┆ 2    │
│ bananna ┆ banana  ┆ 1    │
└─────────┴─────────┴──────┘

ps.format

ps.format(
    template,
    *args,
    **kwargs,
)

Format columns into a string, the way str.format formats values.

template uses the same {field:spec} syntax as str.format (built on the same string.Formatter parser). Any field whose value is a pl.Expr is formatted per-row via Python’s own format(value, spec). A field whose value is a plain Python value (not a pl.Expr) is formatted once, immediately, like ordinary str.format.

template str

*args = ()

**kwargs = {}

Examples:

import polars as pl
import polarstation as ps

df = pl.DataFrame({"err": [0.5, 1.25, 12.0]})
df.with_columns(msg=ps.format("error={:.2f}", pl.col("err")))

shape: (3, 2)
┌──────┬─────────────┐
│ err  ┆ msg         │
│ ---  ┆ ---         │
│ f64  ┆ str         │
╞══════╪═════════════╡
│ 0.5  ┆ error=0.50  │
│ 1.25 ┆ error=1.25  │
│ 12.0 ┆ error=12.00 │
└──────┴─────────────┘

A single expression can also be formatted fluently via Expr.ps_str.format:

df.ps.with_columns(msg=pl.col("err").ps_str.format("error={:.2f}"))

shape: (3, 2)
┌──────┬─────────────┐
│ err  ┆ msg         │
│ ---  ┆ ---         │
│ f64  ┆ str         │
╞══════╪═════════════╡
│ 0.5  ┆ error=0.50  │
│ 1.25 ┆ error=1.25  │
│ 12.0 ┆ error=12.00 │
└──────┴─────────────┘

For several expressions in one template, ps.fmt_col(...) marks the spot inside a real f-string — the format spec (:.2f below) is written exactly where it would be for any other value:

df2 = pl.DataFrame({"err": [0.5, 1.25], "n": [3, 12]})
df2.with_columns(
    msg=ps.format(f"error={ps.fmt_col('err'):.2f} (n={ps.fmt_col('n')})")
)

shape: (2, 3)
┌──────┬─────┬───────────────────┐
│ err  ┆ n   ┆ msg               │
│ ---  ┆ --- ┆ ---               │
│ f64  ┆ i64 ┆ str               │
╞══════╪═════╪═══════════════════╡
│ 0.5  ┆ 3   ┆ error=0.50 (n=3)  │
│ 1.25 ┆ 12  ┆ error=1.25 (n=12) │
└──────┴─────┴───────────────────┘

ps.fmt_col

ps.fmt_col(
    column,
)

Mark a column for embedding inside a real f-string, for a following ps.format(...).

Shorthand for FmtPlaceholder(pl.col(column)) when given a string, or FmtPlaceholder(column) directly when given a pl.Expr — see ps.format for the full explanation and examples.

column IntoExpr

Examples:

import polars as pl
import polarstation as ps

df = pl.DataFrame({"err": [0.5, 1.25], "n": [3, 12]})
df.with_columns(
    msg=ps.format(f"error={ps.fmt_col('err'):.2f} (n={ps.fmt_col('n')})")
)

shape: (2, 3)
┌──────┬─────┬───────────────────┐
│ err  ┆ n   ┆ msg               │
│ ---  ┆ --- ┆ ---               │
│ f64  ┆ i64 ┆ str               │
╞══════╪═════╪═══════════════════╡
│ 0.5  ┆ 3   ┆ error=0.50 (n=3)  │
│ 1.25 ┆ 12  ┆ error=1.25 (n=12) │
└──────┴─────┴───────────────────┘

Internals

FrameExpr

FrameExpr(
    col_expr,
    resolver,
)

An expression that requires a LazyFrame context to resolve into a list of pl.Expr.

A plain pl.Expr is insufficient for operations like ps_enum.make() or ps_chop.chop() because Polars needs to know the output dtype (e.g. the exact pl.Enum([...]) category list) at plan-construction time — before any data is seen. FrameExpr defers that resolution to a two-phase execution model:

Phase 1 — peek ps.with_columns calls resolve(lf) with the current LazyFrame. The resolver runs a small aggregation (e.g. unique().sort() for category discovery, a handful of quantiles for binning) and collects it. Because the resolver receives the full lazy plan up to that point, any preceding .filter() or .select() calls are already embedded and Polars’ predicate/projection pushdown applies — only the relevant rows and columns are scanned.

Phase 2 — expression The resolver uses the aggregation result to construct a concrete pl.Expr with all dtype information baked in (e.g. pl.col("x").cast(pl.Enum(["a", "b", "c"]))). This expression is inserted back into the lazy plan and executed lazily together with all subsequent operations.

col_expr pl.Expr

resolver Callable[[pl.LazyFrame], list[pl.Expr]]