rustyconover opened a new issue, #423: URL: https://github.com/apache/arrow-js/issues/423
## Summary Arrow JS has no mechanism to register custom getters for Arrow extension types. Columns with `ARROW:extension:name` and `ARROW:extension:metadata` field metadata always return raw bytes from `get()`. Every consumer must independently check metadata and decode values. ## Background The Arrow Extension Type spec ([format docs](https://arrow.apache.org/docs/format/Columnar.html#extension-types)) allows producers to annotate fields with semantic type information via metadata: - `ARROW:extension:name` — type identifier (e.g., `"arrow.uuid"`, `"arrow.opaque"`) - `ARROW:extension:metadata` — serialized type parameters (e.g., `{"type_name": "hugeint", "vendor_name": "DuckDB"}`) Other Arrow implementations provide extension type registration: - **Arrow C++**: `arrow::ExtensionType` — register a subclass with `RegisterExtensionType()`, and IPC deserialization automatically produces typed arrays with custom accessors - **Arrow Python**: `pyarrow.ExtensionType` — register with `register_extension_type()`, custom `__arrow_ext_deserialize__` decodes IPC data into Python objects - **Arrow Rust**: `arrow::datatypes::ExtensionType` trait Arrow JS has no equivalent. Extension types are preserved in field metadata but `get()` returns the raw storage value (e.g., `Uint8Array` for `FixedSizeBinary`). ## Impact DuckDB with `arrow_lossless_conversion=true` serializes several types as Arrow extension types: | DuckDB Type | Arrow Storage | Extension Name | Bytes | |---|---|---|---| | `HUGEINT` | FixedSizeBinary[16] | arrow.opaque | 16-byte two's complement signed int | | `UHUGEINT` | FixedSizeBinary[16] | arrow.opaque | 16-byte unsigned int | | `TIME WITH TIME ZONE` | FixedSizeBinary[8] | arrow.opaque | packed micros + offset | | `UUID` | FixedSizeBinary[16] | arrow.uuid | 16 raw bytes | | `BIGNUM` | Binary | arrow.opaque | 3-byte header + big-endian magnitude | | `VARINT` | Binary | arrow.opaque | same as BIGNUM | | `BIT` | Binary | arrow.opaque | padding byte + bit data | For each of these, consumers must: 1. Check `field.metadata.get("ARROW:extension:metadata")` 2. Parse the JSON to get `type_name` 3. Read raw bytes from `column.data[0].values` at the correct offset 4. Interpret the binary encoding (two's complement, packed bitfields, etc.) This is ~100 lines of manual decoding in our codebase, repeated by every consumer that reads DuckDB Arrow output. ## Proposal Add an extension type registry, similar to C++/Python: ```js import { registerExtensionType } from 'apache-arrow'; registerExtensionType({ name: 'arrow.opaque', // matches ARROW:extension:name match: (metadata) => { // optional: filter by extension metadata const parsed = JSON.parse(metadata); return parsed.type_name === 'hugeint'; }, get: (data, index) => { // custom getter, replaces default const dv = new DataView(data.values.buffer, data.values.byteOffset + index * 16, 16); const lo = dv.getBigUint64(0, true); const hi = dv.getBigUint64(8, true); const raw = lo | (hi << 64n); if (raw & (1n << 127n)) { const mask = (1n << 128n) - 1n; return -(((raw ^ mask) + 1n) & mask); } return raw; }, }); ``` After registration, `vector.get(i)` on a HUGEINT column would return a BigInt directly instead of a Uint8Array. This could also support a `serialize` method for the write path, making round-trip extension types fully supported. ## Alternatives - **Do nothing**: consumers continue to manually decode. Works, but fragile and duplicated. - **Vendor-specific packages**: e.g., `@duckdb/arrow-extensions` that monkey-patches Arrow's visitor. Feasible but hacky. - **Local fork of get.mjs**: what we currently do via Vite alias. Maintenance burden. ## Context We maintain a DuckDB WASM frontend that displays query results through Arrow IPC. Every DuckDB extension type requires custom byte-level decoding because Arrow JS can't be taught about them. The same decoding logic would need to be written by anyone consuming DuckDB, Spark, or other engines that use Arrow extension types in JS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
