paleolimbot opened a new issue, #49704: URL: https://github.com/apache/arrow/issues/49704
### Describe the bug, including details regarding any error messages, version, and platform. When implementing nanoarrow's IPC reader ( https://github.com/apache/arrow-nanoarrow/pull/861 ) I was surprised at the interaction of dictionary encodings and extension types. My personal summary (which could use checking!) is: - ExtensionType with dictionary storage roundtrips over the C Data interface AND IPC - DictionaryType with extension value type roundtrips over the C Data interface but NOT IPC Specifically for IPC, the DictionaryType with extension type values is exported identically to ExtensionType with dictionary storage (i.e., extension metadata at the top level). Because almost no extension type (except possibly arrow.opaque, which can support any extension type in theory) actually supports DicionaryType storage (i.e., will error when deserialized), this is a somewhat confusing default. For now I will probably expose this as an option since I've implemented both deserialization pathways already (and the one that passes the integration tests is the least realistic version). A related Rust issue is https://github.com/apache/arrow-rs/issues/7982 (where the `DataType` is not capable of representing a DictionaryType with extension value type). <details> ```python import pyarrow as pa import nanoarrow as na class DictEncodedExtensionType(pa.ExtensionType): def __init__(self): storage_type = pa.dictionary(pa.int32(), pa.string()) super().__init__(storage_type, "example.dict_encoded_ext") def __arrow_ext_serialize__(self): return b"" @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): assert isinstance(storage_type, pa.DictionaryType) return cls() # Register once per process try: pa.register_extension_type(DictEncodedExtensionType()) except pa.ArrowKeyError: pass # Create both cases as Arrow types extension_dictionary_storage = DictEncodedExtensionType() extension_dictionary_storage #> DictEncodedExtensionType(DictionaryType(dictionary<values=string, indices=int32, ordered=0>)) dictionary_of_extension = pa.dictionary(pa.int32(), pa.uuid()) dictionary_of_extension # DictionaryType(dictionary<values=extension<arrow.uuid>, indices=int32, ordered=0>) na_extension_dictionary_storage = na.c_schema(extension_dictionary_storage) na_extension_dictionary_storage # <nanoarrow.c_schema.CSchema example.dict_encoded_ext{dictionary(int32)<string>}> # - format: 'i' # - name: '' # - flags: 2 # - metadata: # - b'ARROW:extension:name': b'example.dict_encoded_ext' # - b'ARROW:extension:metadata': b'' # - dictionary: <nanoarrow.c_schema.CSchema string> # - format: 'u' # - name: '' # - flags: 2 # - metadata: NULL # - dictionary: NULL # - children[0]: # - children[0]: na_dictionary_of_extension = na.c_schema(dictionary_of_extension) na_dictionary_of_extension # <nanoarrow.c_schema.CSchema dictionary(int32)<arrow.uuid{fixed_size_binary(16)}>> # - format: 'i' # - name: '' # - flags: 2 # - metadata: NULL # - dictionary: <nanoarrow.c_schema.CSchema arrow.uuid{fixed_size_binary(16)}> # - format: 'w:16' # - name: '' # - flags: 2 # - metadata: # - b'ARROW:extension:name': b'arrow.uuid' # - b'ARROW:extension:metadata': b'' # - dictionary: NULL # - children[0]: # - children[0]: # Can roundtrip both over C Data Interface pa.DataType._import_from_c_capsule(na_extension_dictionary_storage.__arrow_c_schema__()) # DictEncodedExtensionType(DictionaryType(dictionary<values=string, indices=int32, ordered=0>)) pa.DataType._import_from_c_capsule(na_dictionary_of_extension.__arrow_c_schema__()) # DictionaryType(dictionary<values=extension<arrow.uuid>, indices=int32, ordered=0>) schema = pa.schema( { "extension_dictionary_storage": extension_dictionary_storage, "dictionary_of_extension": dictionary_of_extension, } ) schema # extension_dictionary_storage: extension<example.dict_encoded_ext<DictEncodedExtensionType>> # dictionary_of_extension: dictionary<values=extension<arrow.uuid>, indices=int32, ordered=0> schema_bytes = schema.serialize() with pa.ipc.open_stream(schema_bytes) as s: print(s.schema) # ArrowInvalid: Invalid storage type for UuidType: dictionary<values=fixed_size_binary[16], indices=int32, ordered=0> ``` </details> ### Component(s) Format, C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
