chop — label and struct reference

chop(breaks, ...) bins a column at explicit breakpoints. This notebook shows how the label text changes across the four supported dtype families and the three key parameters: left_closed, extend, and fmt.

All examples use a single break so the output is always two bins.

Dtype families

Float (continuous)

Data: [1.0, 3.0, 7.0, 10.0], break at 5.0.

Float labels treat the break as a mathematical boundary; the interval that includes the break value is the one closed at that end.

`extend`	`left_closed`	lower bin	upper bin
`True` (default)	`True` (default)	`[-∞, 5)`	`[5, ∞)`
`True`	`False`	`(-∞, 5]`	`(5, ∞]`
`False`	`True`	`[1, 5)`	`[5, 10]`
`False`	`False`	`[1, 5]`	`(5, 10]`

Code

df_f = pl.DataFrame({"x": [1.0, 3.0, 7.0, 10.0]})
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], extend=True, left_closed=True))

Enum(categories=['[-∞, 5)', '[5, ∞)'])

show(df_f, "x", pl.col("x").ps_chop.chop([5.0], extend=True, left_closed=False))

Enum(categories=['(-∞, 5]', '(5, ∞]'])

show(df_f, "x", pl.col("x").ps_chop.chop([5.0], extend=False, left_closed=True))

Enum(categories=['[1, 5)', '[5, 10]'])

show(df_f, "x", pl.col("x").ps_chop.chop([5.0], extend=False, left_closed=False))

Enum(categories=['[1, 5]', '(5, 10]'])

extend=True (default) makes the outermost edges run to ±∞ regardless of the data range. extend=False closes them at the observed min / max.

With extend=False and left_closed=True, the last interval uses ] because its upper bound is finite (the data maximum).

Integer (discrete)

Data: [1, 3, 7, 10] (Int32), break at 5.

Integer labels convert each half-open cut interval to the equivalent closed discrete range, and use {x} for single-element bins.

`extend`	`left_closed`	lower bin	upper bin
`True` (default)	`True` (default)	`(-∞, 4]`	`[5, +∞)`
`True`	`False`	`(-∞, 5]`	`[6, +∞)`
`False`	`True`	`[1, 4]`	`[5, 10]`
`False`	`False`	`[1, 5]`	`[6, 10]`

Code

df_i = pl.DataFrame({"x": pl.Series([1, 3, 7, 10], dtype=pl.Int32)})
show(df_i, "x", pl.col("x").ps_chop.chop([5], extend=True, left_closed=True))

Enum(categories=['(-∞, 4]', '[5, +∞)'])

show(df_i, "x", pl.col("x").ps_chop.chop([5], extend=True, left_closed=False))

Enum(categories=['(-∞, 5]', '[6, +∞)'])

show(df_i, "x", pl.col("x").ps_chop.chop([5], extend=False, left_closed=True))

Enum(categories=['[1, 4]', '[5, 10]'])

show(df_i, "x", pl.col("x").ps_chop.chop([5], extend=False, left_closed=False))

Enum(categories=['[1, 5]', '[6, 10]'])

With left_closed=True the break value (5) starts the upper bin, so the lower bin’s last integer is 4. With left_closed=False the break belongs to the lower bin (≤ 5), so the upper bin starts at 6.

String / Enum (categorical)

Data: ["cat", "dog", "fish", "gull"], break at "dog".

extend has no effect on categorical columns — the outermost labels always use the first and last observed category. Breaks are matched against the sorted category list (or the Enum’s defined order).

`left_closed`	lower bin	upper bin
`True` (default)	`{cat}`	`[dog, gull]`
`False`	`[cat, dog]`	`[fish, gull]`

Code

df_s = pl.DataFrame({"x": ["cat", "dog", "fish", "gull"]})
show(df_s, "x", pl.col("x").ps_chop.chop(["dog"], left_closed=True))

Enum(categories=['{cat}', '[dog, gull]'])

show(df_s, "x", pl.col("x").ps_chop.chop(["dog"], left_closed=False))

Enum(categories=['[cat, dog]', '[fish, gull]'])

With left_closed=True the break (“dog”) opens the upper bin, so the lower bin contains only “cat” → single-element notation {cat}. With left_closed=False the break belongs to the lower bin, so lower is [cat, dog] and upper starts at the next category “fish”.

Enum columns work identically but the category order comes from the Enum definition rather than alphabetical sort:

Code

df_e = pl.DataFrame({"x": pl.Series(
    ["low", "medium", "high", "medium"],
    dtype=pl.Enum(["low", "medium", "high"])   # non-alphabetical order
)})
show(df_e, "x", pl.col("x").ps_chop.chop(["medium"]))

Enum(categories=['{low}', '[medium, high]'])

Temporal (Date / Datetime / Duration)

Data: four dates spanning 2020, break at 2020-07-01.

extend has no effect on temporal columns — bounds are always the observed min / max. Labels use str() of each value by default.

`left_closed`	lower bin	upper bin
`True` (default)	`[2020-01-01, 2020-07-01)`	`[2020-07-01, 2020-12-01]`
`False`	`[2020-01-01, 2020-07-01]`	`(2020-07-01, 2020-12-01]`

Code

D = datetime.date
df_d = pl.DataFrame({"d": [D(2020,1,1), D(2020,4,1), D(2020,9,1), D(2020,12,1)]})
show(df_d, "d", pl.col("d").ps_chop.chop([D(2020,7,1)], left_closed=True))

Enum(categories=['[2020-01-01, 2020-07-01)', '[2020-07-01, 2020-12-01]'])

show(df_d, "d", pl.col("d").ps_chop.chop([D(2020,7,1)], left_closed=False))

Enum(categories=['[2020-01-01, 2020-07-01]', '(2020-07-01, 2020-12-01]'])

`fmt` — custom label formatting

Float / Integer: format-spec string or callable

df_f = pl.DataFrame({"x": [1.0, 3.0, 7.0, 10.0]})
# Format-spec string
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], fmt=".2f"))

Enum(categories=['[-∞, 5.00)', '[5.00, ∞)'])

# Callable
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], fmt=lambda v: f"${v:.0f}"))

Enum(categories=['[-∞, $5)', '[$5, ∞)'])

String / Enum: callable applied to each category name

df_s = pl.DataFrame({"x": ["cat", "dog", "fish", "gull"]})
show(df_s, "x", pl.col("x").ps_chop.chop(["dog"], fmt=str.upper))

Enum(categories=['{CAT}', '[DOG, GULL]'])

Temporal: callable applied to each bound value

D = datetime.date
df_d = pl.DataFrame({"d": [D(2020,1,1), D(2020,4,1), D(2020,9,1), D(2020,12,1)]})
show(df_d, "d", pl.col("d").ps_chop.chop([D(2020,7,1)], fmt=lambda d: d.strftime("%b %Y")))

Enum(categories=['[Jan 2020, Jul 2020)', '[Jul 2020, Dec 2020]'])

Parameter summary

Parameter	Float	Integer	String / Enum	Temporal
`left_closed`	controls which side of break is open	same, but label adjusts to show discrete range	controls whether break belongs to lower or upper bin	controls which side is open
`extend`	`True` → ±∞ outer labels; `False` → data min/max	same	no effect (always data min/max)	no effect (always data min/max)
`fmt`	format-spec string or `Callable[[float], str]`	same	`Callable[[str], str]`	`Callable[[value], str]`

Differences from santoku

ps_chop is inspired by R’s santoku package but diverges in several ways.

extend semantics — santoku’s default is extend = NULL, which uses the data range for outer bin boundaries — equivalent to our extend=False. Santoku’s extend = FALSE turns values outside any break into null; we have no equivalent mode.

Last interval — santoku’s close_end parameter controls whether the final interval is closed on both sides. We always close the last interval when its bound is finite, matching santoku’s default close_end = TRUE.

Integer columns — santoku uses half-open [a, b) notation for all numeric types. We use fully-closed [a, b] notation for integer columns, converting each half-open boundary to the integer it contains (e.g. [lo, 5) becomes [lo, 4]).

String / Enum columns — santoku treats character vectors as unordered and errors on chop; quantile-based functions warn. We support string and Enum columns natively, with labels like [apple, cherry] or {singleton}.

drop parameter — santoku can drop empty factor levels from the result. We always preserve all Enum categories; empty bins appear as unused levels.

Warnings — santoku warns when quantile boundaries collapse on identical data, or when ties force groups larger than n in chop_n. We silently handle both cases.

Dtype families

Float (continuous)

Integer (discrete)

String / Enum (categorical)

Temporal (Date / Datetime / Duration)

fmt — custom label formatting

Float / Integer: format-spec string or callable

String / Enum: callable applied to each category name

Temporal: callable applied to each bound value

Parameter summary

Differences from santoku

`fmt` — custom label formatting