chop — label and struct reference

chop(breaks, ...) bins a column at explicit breakpoints. This notebook shows how the label text changes across the four supported dtype families and the three key parameters: left_closed, extend, and fmt.

All examples use a single break so the output is always two bins.

Dtype families

Float (continuous)

Data: [1.0, 3.0, 7.0, 10.0], break at 5.0.

Float labels treat the break as a mathematical boundary; the interval that includes the break value is the one closed at that end.

extend left_closed lower bin upper bin
True (default) True (default) [-∞, 5) [5, ∞)
True False (-∞, 5] (5, ∞]
False True [1, 5) [5, 10]
False False [1, 5] (5, 10]
df_f = pl.DataFrame({"x": [1.0, 3.0, 7.0, 10.0]})
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], extend=True, left_closed=True))
Enum(categories=['[-∞, 5)', '[5, ∞)'])
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], extend=True, left_closed=False))
Enum(categories=['(-∞, 5]', '(5, ∞]'])
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], extend=False, left_closed=True))
Enum(categories=['[1, 5)', '[5, 10]'])
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], extend=False, left_closed=False))
Enum(categories=['[1, 5]', '(5, 10]'])

extend=True (default) makes the outermost edges run to ±∞ regardless of the data range. extend=False closes them at the observed min / max.

With extend=False and left_closed=True, the last interval uses ] because its upper bound is finite (the data maximum).

Integer (discrete)

Data: [1, 3, 7, 10] (Int32), break at 5.

Integer labels convert each half-open cut interval to the equivalent closed discrete range, and use {x} for single-element bins.

extend left_closed lower bin upper bin
True (default) True (default) (-∞, 4] [5, +∞)
True False (-∞, 5] [6, +∞)
False True [1, 4] [5, 10]
False False [1, 5] [6, 10]
df_i = pl.DataFrame({"x": pl.Series([1, 3, 7, 10], dtype=pl.Int32)})
show(df_i, "x", pl.col("x").ps_chop.chop([5], extend=True, left_closed=True))
Enum(categories=['(-∞, 4]', '[5, +∞)'])
show(df_i, "x", pl.col("x").ps_chop.chop([5], extend=True, left_closed=False))
Enum(categories=['(-∞, 5]', '[6, +∞)'])
show(df_i, "x", pl.col("x").ps_chop.chop([5], extend=False, left_closed=True))
Enum(categories=['[1, 4]', '[5, 10]'])
show(df_i, "x", pl.col("x").ps_chop.chop([5], extend=False, left_closed=False))
Enum(categories=['[1, 5]', '[6, 10]'])

With left_closed=True the break value (5) starts the upper bin, so the lower bin’s last integer is 4. With left_closed=False the break belongs to the lower bin (≤ 5), so the upper bin starts at 6.

String / Enum (categorical)

Data: ["cat", "dog", "fish", "gull"], break at "dog".

extend has no effect on categorical columns — the outermost labels always use the first and last observed category. Breaks are matched against the sorted category list (or the Enum’s defined order).

left_closed lower bin upper bin
True (default) {cat} [dog, gull]
False [cat, dog] [fish, gull]
df_s = pl.DataFrame({"x": ["cat", "dog", "fish", "gull"]})
show(df_s, "x", pl.col("x").ps_chop.chop(["dog"], left_closed=True))
Enum(categories=['{cat}', '[dog, gull]'])
show(df_s, "x", pl.col("x").ps_chop.chop(["dog"], left_closed=False))
Enum(categories=['[cat, dog]', '[fish, gull]'])

With left_closed=True the break (“dog”) opens the upper bin, so the lower bin contains only “cat” → single-element notation {cat}. With left_closed=False the break belongs to the lower bin, so lower is [cat, dog] and upper starts at the next category “fish”.

Enum columns work identically but the category order comes from the Enum definition rather than alphabetical sort:

df_e = pl.DataFrame({"x": pl.Series(
    ["low", "medium", "high", "medium"],
    dtype=pl.Enum(["low", "medium", "high"])   # non-alphabetical order
)})
show(df_e, "x", pl.col("x").ps_chop.chop(["medium"]))
Enum(categories=['{low}', '[medium, high]'])

Temporal (Date / Datetime / Duration)

Data: four dates spanning 2020, break at 2020-07-01.

extend has no effect on temporal columns — bounds are always the observed min / max. Labels use str() of each value by default.

left_closed lower bin upper bin
True (default) [2020-01-01, 2020-07-01) [2020-07-01, 2020-12-01]
False [2020-01-01, 2020-07-01] (2020-07-01, 2020-12-01]
D = datetime.date
df_d = pl.DataFrame({"d": [D(2020,1,1), D(2020,4,1), D(2020,9,1), D(2020,12,1)]})
show(df_d, "d", pl.col("d").ps_chop.chop([D(2020,7,1)], left_closed=True))
Enum(categories=['[2020-01-01, 2020-07-01)', '[2020-07-01, 2020-12-01]'])
show(df_d, "d", pl.col("d").ps_chop.chop([D(2020,7,1)], left_closed=False))
Enum(categories=['[2020-01-01, 2020-07-01]', '(2020-07-01, 2020-12-01]'])

fmt — custom label formatting

Float / Integer: format-spec string or callable

df_f = pl.DataFrame({"x": [1.0, 3.0, 7.0, 10.0]})
# Format-spec string
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], fmt=".2f"))
Enum(categories=['[-∞, 5.00)', '[5.00, ∞)'])
# Callable
show(df_f, "x", pl.col("x").ps_chop.chop([5.0], fmt=lambda v: f"${v:.0f}"))
Enum(categories=['[-∞, $5)', '[$5, ∞)'])

String / Enum: callable applied to each category name

df_s = pl.DataFrame({"x": ["cat", "dog", "fish", "gull"]})
show(df_s, "x", pl.col("x").ps_chop.chop(["dog"], fmt=str.upper))
Enum(categories=['{CAT}', '[DOG, GULL]'])

Temporal: callable applied to each bound value

D = datetime.date
df_d = pl.DataFrame({"d": [D(2020,1,1), D(2020,4,1), D(2020,9,1), D(2020,12,1)]})
show(df_d, "d", pl.col("d").ps_chop.chop([D(2020,7,1)], fmt=lambda d: d.strftime("%b %Y")))
Enum(categories=['[Jan 2020, Jul 2020)', '[Jul 2020, Dec 2020]'])

Parameter summary

Parameter Float Integer String / Enum Temporal
left_closed controls which side of break is open same, but label adjusts to show discrete range controls whether break belongs to lower or upper bin controls which side is open
extend True → ±∞ outer labels; False → data min/max same no effect (always data min/max) no effect (always data min/max)
fmt format-spec string or Callable[[float], str] same Callable[[str], str] Callable[[value], str]

Differences from santoku

ps_chop is inspired by R’s santoku package but diverges in several ways.

extend semantics — santoku’s default is extend = NULL, which uses the data range for outer bin boundaries — equivalent to our extend=False. Santoku’s extend = FALSE turns values outside any break into null; we have no equivalent mode.

Last interval — santoku’s close_end parameter controls whether the final interval is closed on both sides. We always close the last interval when its bound is finite, matching santoku’s default close_end = TRUE.

Integer columns — santoku uses half-open [a, b) notation for all numeric types. We use fully-closed [a, b] notation for integer columns, converting each half-open boundary to the integer it contains (e.g. [lo, 5) becomes [lo, 4]).

String / Enum columns — santoku treats character vectors as unordered and errors on chop; quantile-based functions warn. We support string and Enum columns natively, with labels like [apple, cherry] or {singleton}.

drop parameter — santoku can drop empty factor levels from the result. We always preserve all Enum categories; empty bins appear as unused levels.

Warnings — santoku warns when quantile boundaries collapse on identical data, or when ties force groups larger than n in chop_n. We silently handle both cases.