rustyconover opened a new issue, #421:
URL: https://github.com/apache/arrow-js/issues/421

   ## Summary
   
   `Vector.get(i)` on Timestamp columns throws `TypeError` for valid timestamp 
values whose microsecond representation exceeds `Number.MAX_SAFE_INTEGER`. This 
makes large portions of the valid timestamp range inaccessible through Arrow 
JS's public API.
   
   ## Root cause
   
   The timestamp get visitors in `visitor/get.mjs` convert `BigInt64Array` 
values to `Number` via `bigIntToNumber()` / `divideBigInts()`:
   
   ```js
   const getTimestampSecond = ({ values }, index) => 1000 * 
bigIntToNumber(values[index]);
   const getTimestampMillisecond = ({ values }, index) => 
bigIntToNumber(values[index]);
   const getTimestampMicrosecond = ({ values }, index) => 
divideBigInts(values[index], BigInt(1000));
   const getTimestampNanosecond = ({ values }, index) => 
divideBigInts(values[index], BigInt(1000000));
   ```
   
   `bigIntToNumber()` throws if the value is outside `[-MAX_SAFE_INTEGER, 
MAX_SAFE_INTEGER]`:
   
   ```js
   export function bigIntToNumber(number) {
       if (typeof number === 'bigint' && (number < Number.MIN_SAFE_INTEGER || 
number > Number.MAX_SAFE_INTEGER)) {
           throw new TypeError(`${number} is not safe to convert to a number.`);
       }
       return Number(number);
   }
   ```
   
   ## Impact
   
   For `Timestamp<MICROSECOND>` (DuckDB's default, and common in Parquet files):
   - **Accessible range**: 1970-04-26 to 2255-06-05 (~285 years)
   - **Valid DuckDB range**: 290,309 BC to 294,247 AD (~584,000 years)  
   - **Infinity sentinels** (`INT64_MAX`, `INT64_MIN+1`): completely 
inaccessible
   
   This means `get()` throws — not returns null, not returns a sentinel, but 
*throws an exception* — for valid data that was correctly serialized and 
deserialized through IPC.
   
   For `Timestamp<NANOSECOND>`, the accessible range is even narrower.
   
   The conversion is also lossy within range. `divideBigInts(value, 1000n)` 
divides BigInt microseconds into a float, losing sub-millisecond precision. A 
microsecond timestamp of `1652397825000001n` returns `1652397825000.001` — 
correct here, but precision degrades as values grow.
   
   ## Comparison with other Arrow implementations
   
   - **Arrow C++**: Returns `int64_t`. Caller interprets.
   - **Arrow Python**: Returns raw int via `.as_py()` on scalar, or numpy int64 
in arrays. `to_pylist()` preserves int64.
   - **Arrow Rust**: Returns `i64`.
   - **Arrow JS**: Converts to `Number`, throws if out of range.
   
   Arrow JS is the only implementation where `get()` can throw on valid data.
   
   Note that `visitInt64` and `visitUint64` already return raw BigInts:
   ```js
   const getBigInts = ({ values }, index) => values[index];
   ```
   
   The timestamp visitors are inconsistent with this — they convert instead of 
returning the raw value.
   
   ## Proposal
   
   **Return raw BigInt from timestamp visitors**, matching Int64/Uint64 
behavior:
   
   ```js
   const getTimestampSecond = ({ values }, index) => values[index];
   const getTimestampMillisecond = ({ values }, index) => values[index];
   const getTimestampMicrosecond = ({ values }, index) => values[index];
   const getTimestampNanosecond = ({ values }, index) => values[index];
   ```
   
   The raw BigInt preserves full precision and never throws. Callers that want 
a JS `Date` or millisecond Number can convert explicitly.
   
   For convenience, a `toDate(index)` method or utility function could provide 
the current behavior for callers in the common case (recent dates, millisecond 
precision is fine):
   
   ```js
   function timestampToDate(vector, index) {
     const raw = vector.get(index);
     if (raw === null) return null;
     const ms = Number(raw / 1000n); // microseconds to ms, lossy but fine for 
Date
     return new Date(ms);
   }
   ```
   
   Similarly, `getDateDay` should return raw int32 days rather than converting 
to epoch milliseconds, since the millisecond conversion loses precision for 
dates far from epoch.
   
   ### Breaking change
   
   This is a breaking change for code that assumes `get()` on a timestamp 
column returns a Number. The alternatives are:
   
   1. **Breaking change in a major version** — cleanest, aligns with 
C++/Python/Rust
   2. **Opt-in flag** (e.g., `{ useBigInt: true }` on reader options) — 
backwards-compatible but adds API surface
   3. **Return BigInt only when the value is outside safe range** — surprising 
behavior difference based on data values
   
   I'd advocate for option 1. Returning BigInt is consistent with how 
Int64/Uint64 already behave, and the current behavior (throwing) is arguably 
already broken — callers can't rely on `get()` succeeding, which defeats the 
purpose of a typed API.
   
   ## Context
   
   We hit this building a DuckDB WASM shell that displays `test_all_types()` 
output. Lists containing timestamps with infinity sentinels, extreme-range 
timestamps, and nanosecond-precision timestamps all trigger this. Our 
workaround reads raw `BigInt64Array` values from `column.data[0].values`, 
bypassing Arrow's getter entirely — which works but defeats the purpose of the 
Vector abstraction.
   
   Related issues: #91 (timestamp types all return same values), #83 (pre-epoch 
timestamp offset errors), #77 (decimal conversion errors). These are all 
symptoms of the same underlying design: eagerly converting int64 → Number in 
get visitors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to