df_f = pl.DataFrame({"x": [1.0, 3.0, 7.0, 10.0]})
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], extend=True, left_closed=True))Enum(categories=['[-∞, 5)', '[5, ∞)'])
chop(breaks, ...) bins a column at explicit breakpoints. This notebook shows how the label text changes across the four supported dtype families and the three key parameters: left_closed, extend, and fmt.
All examples use a single break so the output is always two bins.
Data: [1.0, 3.0, 7.0, 10.0], break at 5.0.
Float labels treat the break as a mathematical boundary; the interval that includes the break value is the one closed at that end.
extend |
left_closed |
lower bin | upper bin |
|---|---|---|---|
True (default) |
True (default) |
[-∞, 5) |
[5, ∞) |
True |
False |
(-∞, 5] |
(5, ∞] |
False |
True |
[1, 5) |
[5, 10] |
False |
False |
[1, 5] |
(5, 10] |
Enum(categories=['[-∞, 5)', '[5, ∞)'])
Enum(categories=['(-∞, 5]', '(5, ∞]'])
Enum(categories=['[1, 5)', '[5, 10]'])
extend=True (default) makes the outermost edges run to ±∞ regardless of the data range. extend=False closes them at the observed min / max.
With extend=False and left_closed=True, the last interval uses ] because its upper bound is finite (the data maximum).
Data: [1, 3, 7, 10] (Int32), break at 5.
Integer labels convert each half-open cut interval to the equivalent closed discrete range, and use {x} for single-element bins.
extend |
left_closed |
lower bin | upper bin |
|---|---|---|---|
True (default) |
True (default) |
(-∞, 4] |
[5, +∞) |
True |
False |
(-∞, 5] |
[6, +∞) |
False |
True |
[1, 4] |
[5, 10] |
False |
False |
[1, 5] |
[6, 10] |
Enum(categories=['(-∞, 4]', '[5, +∞)'])
Enum(categories=['(-∞, 5]', '[6, +∞)'])
Enum(categories=['[1, 4]', '[5, 10]'])
With left_closed=True the break value (5) starts the upper bin, so the lower bin’s last integer is 4. With left_closed=False the break belongs to the lower bin (≤ 5), so the upper bin starts at 6.
Data: ["cat", "dog", "fish", "gull"], break at "dog".
extend has no effect on categorical columns — the outermost labels always use the first and last observed category. Breaks are matched against the sorted category list (or the Enum’s defined order).
left_closed |
lower bin | upper bin |
|---|---|---|
True (default) |
{cat} |
[dog, gull] |
False |
[cat, dog] |
[fish, gull] |
With left_closed=True the break (“dog”) opens the upper bin, so the lower bin contains only “cat” → single-element notation {cat}. With left_closed=False the break belongs to the lower bin, so lower is [cat, dog] and upper starts at the next category “fish”.
Enum columns work identically but the category order comes from the Enum definition rather than alphabetical sort:
Data: four dates spanning 2020, break at 2020-07-01.
extend has no effect on temporal columns — bounds are always the observed min / max. Labels use str() of each value by default.
left_closed |
lower bin | upper bin |
|---|---|---|
True (default) |
[2020-01-01, 2020-07-01) |
[2020-07-01, 2020-12-01] |
False |
[2020-01-01, 2020-07-01] |
(2020-07-01, 2020-12-01] |
Enum(categories=['[2020-01-01, 2020-07-01)', '[2020-07-01, 2020-12-01]'])
fmt — custom label formattingEnum(categories=['[-∞, 5.00)', '[5.00, ∞)'])
| Parameter | Float | Integer | String / Enum | Temporal |
|---|---|---|---|---|
left_closed |
controls which side of break is open | same, but label adjusts to show discrete range | controls whether break belongs to lower or upper bin | controls which side is open |
extend |
True → ±∞ outer labels; False → data min/max |
same | no effect (always data min/max) | no effect (always data min/max) |
fmt |
format-spec string or Callable[[float], str] |
same | Callable[[str], str] |
Callable[[value], str] |
ps_chop is inspired by R’s santoku package but diverges in several ways.
extend semantics — santoku’s default is extend = NULL, which uses the data range for outer bin boundaries — equivalent to our extend=False. Santoku’s extend = FALSE turns values outside any break into null; we have no equivalent mode.
Last interval — santoku’s close_end parameter controls whether the final interval is closed on both sides. We always close the last interval when its bound is finite, matching santoku’s default close_end = TRUE.
Integer columns — santoku uses half-open [a, b) notation for all numeric types. We use fully-closed [a, b] notation for integer columns, converting each half-open boundary to the integer it contains (e.g. [lo, 5) becomes [lo, 4]).
String / Enum columns — santoku treats character vectors as unordered and errors on chop; quantile-based functions warn. We support string and Enum columns natively, with labels like [apple, cherry] or {singleton}.
drop parameter — santoku can drop empty factor levels from the result. We always preserve all Enum categories; empty bins appear as unused levels.
Warnings — santoku warns when quantile boundaries collapse on identical data, or when ties force groups larger than n in chop_n. We silently handle both cases.