rustyconover opened a new issue, #421:
URL: https://github.com/apache/arrow-js/issues/421
## Summary
`Vector.get(i)` on Timestamp columns throws `TypeError` for valid timestamp
values whose microsecond representation exceeds `Number.MAX_SAFE_INTEGER`. This
makes large portions of the valid timestamp range inaccessible through Arrow
JS's public API.
## Root cause
The timestamp get visitors in `visitor/get.mjs` convert `BigInt64Array`
values to `Number` via `bigIntToNumber()` / `divideBigInts()`:
```js
const getTimestampSecond = ({ values }, index) => 1000 *
bigIntToNumber(values[index]);
const getTimestampMillisecond = ({ values }, index) =>
bigIntToNumber(values[index]);
const getTimestampMicrosecond = ({ values }, index) =>
divideBigInts(values[index], BigInt(1000));
const getTimestampNanosecond = ({ values }, index) =>
divideBigInts(values[index], BigInt(1000000));
```
`bigIntToNumber()` throws if the value is outside `[-MAX_SAFE_INTEGER,
MAX_SAFE_INTEGER]`:
```js
export function bigIntToNumber(number) {
if (typeof number === 'bigint' && (number < Number.MIN_SAFE_INTEGER ||
number > Number.MAX_SAFE_INTEGER)) {
throw new TypeError(`${number} is not safe to convert to a number.`);
}
return Number(number);
}
```
## Impact
For `Timestamp<MICROSECOND>` (DuckDB's default, and common in Parquet files):
- **Accessible range**: 1970-04-26 to 2255-06-05 (~285 years)
- **Valid DuckDB range**: 290,309 BC to 294,247 AD (~584,000 years)
- **Infinity sentinels** (`INT64_MAX`, `INT64_MIN+1`): completely
inaccessible
This means `get()` throws — not returns null, not returns a sentinel, but
*throws an exception* — for valid data that was correctly serialized and
deserialized through IPC.
For `Timestamp<NANOSECOND>`, the accessible range is even narrower.
The conversion is also lossy within range. `divideBigInts(value, 1000n)`
divides BigInt microseconds into a float, losing sub-millisecond precision. A
microsecond timestamp of `1652397825000001n` returns `1652397825000.001` —
correct here, but precision degrades as values grow.
## Comparison with other Arrow implementations
- **Arrow C++**: Returns `int64_t`. Caller interprets.
- **Arrow Python**: Returns raw int via `.as_py()` on scalar, or numpy int64
in arrays. `to_pylist()` preserves int64.
- **Arrow Rust**: Returns `i64`.
- **Arrow JS**: Converts to `Number`, throws if out of range.
Arrow JS is the only implementation where `get()` can throw on valid data.
Note that `visitInt64` and `visitUint64` already return raw BigInts:
```js
const getBigInts = ({ values }, index) => values[index];
```
The timestamp visitors are inconsistent with this — they convert instead of
returning the raw value.
## Proposal
**Return raw BigInt from timestamp visitors**, matching Int64/Uint64
behavior:
```js
const getTimestampSecond = ({ values }, index) => values[index];
const getTimestampMillisecond = ({ values }, index) => values[index];
const getTimestampMicrosecond = ({ values }, index) => values[index];
const getTimestampNanosecond = ({ values }, index) => values[index];
```
The raw BigInt preserves full precision and never throws. Callers that want
a JS `Date` or millisecond Number can convert explicitly.
For convenience, a `toDate(index)` method or utility function could provide
the current behavior for callers in the common case (recent dates, millisecond
precision is fine):
```js
function timestampToDate(vector, index) {
const raw = vector.get(index);
if (raw === null) return null;
const ms = Number(raw / 1000n); // microseconds to ms, lossy but fine for
Date
return new Date(ms);
}
```
Similarly, `getDateDay` should return raw int32 days rather than converting
to epoch milliseconds, since the millisecond conversion loses precision for
dates far from epoch.
### Breaking change
This is a breaking change for code that assumes `get()` on a timestamp
column returns a Number. The alternatives are:
1. **Breaking change in a major version** — cleanest, aligns with
C++/Python/Rust
2. **Opt-in flag** (e.g., `{ useBigInt: true }` on reader options) —
backwards-compatible but adds API surface
3. **Return BigInt only when the value is outside safe range** — surprising
behavior difference based on data values
I'd advocate for option 1. Returning BigInt is consistent with how
Int64/Uint64 already behave, and the current behavior (throwing) is arguably
already broken — callers can't rely on `get()` succeeding, which defeats the
purpose of a typed API.
## Context
We hit this building a DuckDB WASM shell that displays `test_all_types()`
output. Lists containing timestamps with infinity sentinels, extreme-range
timestamps, and nanosecond-precision timestamps all trigger this. Our
workaround reads raw `BigInt64Array` values from `column.data[0].values`,
bypassing Arrow's getter entirely — which works but defeats the purpose of the
Vector abstraction.
Related issues: #91 (timestamp types all return same values), #83 (pre-epoch
timestamp offset errors), #77 (decimal conversion errors). These are all
symptoms of the same underlying design: eagerly converting int64 → Number in
get visitors.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]