Hoeze opened a new issue, #50027:
URL: https://github.com/apache/arrow/issues/50027

   ### Describe the enhancement requested
   
   Arrow has no canonical way to represent a bounded range (a mathematical 
interval with a lower and an upper endpoint), e.g. a numeric range `[0, 10)`, a 
date range, or a timestamp period. Today such data is modeled ad hoc with two 
separate columns or with system-specific extension types, which hurts 
interoperability. A canonical range type will be useful to libraries like 
Pandas, Polars/Polars-bio, IRanges/PyRanges, database connectors, ...
   
   Note this is distinct from Arrow's existing calendar `Interval` type 
(`INTERVAL_MONTHS` / `INTERVAL_DAY_TIME` / `INTERVAL_MONTH_DAY_NANO`), which 
represents a duration (a signed amount of time), not a bounded set. Databases 
like PostgreSQL make the same distinction: SQL uses `INTERVAL` for durations 
and `RANGE` / `PERIOD` for bounded sets. This proposal follows that convention 
by naming the type `arrow.range`.
   
   
   ## Proposed design:
   - Extension name: `arrow.range`.
   - Storage type: `Struct<lower: T, upper: T>`. When subtype `T` is nullable, 
a null bound represents an unbounded (infinite) endpoint.
     - Field names `lower` / `upper` (PostgreSQL convention) are chosen 
deliberately for ordering clarity. (Note that Pandas uses  `left` / `right` for 
the field names)
     - The subtype `T` may be any orderable Arrow type (the numeric, temporal 
and decimal families, etc.). Nested or non-comparable types are out of scope.
   - Metadata: a JSON object `{"closed": "..."}`. 
     - Parameter `closed`: one of `left`, `right`, `both`, `neither` (pandas 
vocabulary; `left` = lower inclusive / upper exclusive, etc.).
     - `closed` is required on the wire so a serialized `arrow.range` is always 
unambiguous. Unknown JSON keys are ignored for forward compatibility.
   
   - A range is empty implicitly when `lower > upper`, or when `lower == upper` 
with at least one bound exclusive. A range with `lower > upper` is therefore 
valid (it denotes the empty set), not an error.
   
   
   ### Relation to pandas
   
   This mirrors pandas' interval support and deliberately reuses its vocabulary:
   
   - `pandas.Interval` is the scalar form: an immutable bounded interval whose 
`closed` parameter takes exactly `left`, `right`, `both`, or `neither`; the 
vocabulary adopted here for the `closed` metadata.
   - `pandas.IntervalIndex` / `pandas.arrays.IntervalArray` (dtype 
`interval[T]`) is the columnar form: it stores parallel `left` and `right` 
bound arrays with a single `closed` applying to every element, directly 
analogous to the proposed `Struct<lower, upper>` storage with an object-level 
`closed`.
   - Note that closedness per-row is explicitly not a goal of `arrow.range`. If 
needed, it could be achieved with a Union type.
   
   
   
   ### Component(s)
   
   Format


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to