rustyconover opened a new issue, #423:
URL: https://github.com/apache/arrow-js/issues/423

   ## Summary
   
   Arrow JS has no mechanism to register custom getters for Arrow extension 
types. Columns with `ARROW:extension:name` and `ARROW:extension:metadata` field 
metadata always return raw bytes from `get()`. Every consumer must 
independently check metadata and decode values.
   
   ## Background
   
   The Arrow Extension Type spec ([format 
docs](https://arrow.apache.org/docs/format/Columnar.html#extension-types)) 
allows producers to annotate fields with semantic type information via metadata:
   
   - `ARROW:extension:name` — type identifier (e.g., `"arrow.uuid"`, 
`"arrow.opaque"`)
   - `ARROW:extension:metadata` — serialized type parameters (e.g., 
`{"type_name": "hugeint", "vendor_name": "DuckDB"}`)
   
   Other Arrow implementations provide extension type registration:
   
   - **Arrow C++**: `arrow::ExtensionType` — register a subclass with 
`RegisterExtensionType()`, and IPC deserialization automatically produces typed 
arrays with custom accessors
   - **Arrow Python**: `pyarrow.ExtensionType` — register with 
`register_extension_type()`, custom `__arrow_ext_deserialize__` decodes IPC 
data into Python objects
   - **Arrow Rust**: `arrow::datatypes::ExtensionType` trait
   
   Arrow JS has no equivalent. Extension types are preserved in field metadata 
but `get()` returns the raw storage value (e.g., `Uint8Array` for 
`FixedSizeBinary`).
   
   ## Impact
   
   DuckDB with `arrow_lossless_conversion=true` serializes several types as 
Arrow extension types:
   
   | DuckDB Type | Arrow Storage | Extension Name | Bytes |
   |---|---|---|---|
   | `HUGEINT` | FixedSizeBinary[16] | arrow.opaque | 16-byte two's complement 
signed int |
   | `UHUGEINT` | FixedSizeBinary[16] | arrow.opaque | 16-byte unsigned int |
   | `TIME WITH TIME ZONE` | FixedSizeBinary[8] | arrow.opaque | packed micros 
+ offset |
   | `UUID` | FixedSizeBinary[16] | arrow.uuid | 16 raw bytes |
   | `BIGNUM` | Binary | arrow.opaque | 3-byte header + big-endian magnitude |
   | `VARINT` | Binary | arrow.opaque | same as BIGNUM |
   | `BIT` | Binary | arrow.opaque | padding byte + bit data |
   
   For each of these, consumers must:
   1. Check `field.metadata.get("ARROW:extension:metadata")`
   2. Parse the JSON to get `type_name`
   3. Read raw bytes from `column.data[0].values` at the correct offset
   4. Interpret the binary encoding (two's complement, packed bitfields, etc.)
   
   This is ~100 lines of manual decoding in our codebase, repeated by every 
consumer that reads DuckDB Arrow output.
   
   ## Proposal
   
   Add an extension type registry, similar to C++/Python:
   
   ```js
   import { registerExtensionType } from 'apache-arrow';
   
   registerExtensionType({
     name: 'arrow.opaque',         // matches ARROW:extension:name
     match: (metadata) => {        // optional: filter by extension metadata
       const parsed = JSON.parse(metadata);
       return parsed.type_name === 'hugeint';
     },
     get: (data, index) => {       // custom getter, replaces default
       const dv = new DataView(data.values.buffer, data.values.byteOffset + 
index * 16, 16);
       const lo = dv.getBigUint64(0, true);
       const hi = dv.getBigUint64(8, true);
       const raw = lo | (hi << 64n);
       if (raw & (1n << 127n)) {
         const mask = (1n << 128n) - 1n;
         return -(((raw ^ mask) + 1n) & mask);
       }
       return raw;
     },
   });
   ```
   
   After registration, `vector.get(i)` on a HUGEINT column would return a 
BigInt directly instead of a Uint8Array.
   
   This could also support a `serialize` method for the write path, making 
round-trip extension types fully supported.
   
   ## Alternatives
   
   - **Do nothing**: consumers continue to manually decode. Works, but fragile 
and duplicated.
   - **Vendor-specific packages**: e.g., `@duckdb/arrow-extensions` that 
monkey-patches Arrow's visitor. Feasible but hacky.
   - **Local fork of get.mjs**: what we currently do via Vite alias. 
Maintenance burden.
   
   ## Context
   
   We maintain a DuckDB WASM frontend that displays query results through Arrow 
IPC. Every DuckDB extension type requires custom byte-level decoding because 
Arrow JS can't be taught about them. The same decoding logic would need to be 
written by anyone consuming DuckDB, Spark, or other engines that use Arrow 
extension types in JS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to