paleolimbot opened a new issue, #49704:
URL: https://github.com/apache/arrow/issues/49704

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When implementing nanoarrow's IPC reader ( 
https://github.com/apache/arrow-nanoarrow/pull/861 ) I was surprised at the 
interaction of dictionary encodings and extension types. My personal summary 
(which could use checking!) is:
   
   - ExtensionType with dictionary storage roundtrips over the C Data interface 
AND IPC
   - DictionaryType with extension value type roundtrips over the C Data 
interface but NOT IPC
   
   Specifically for IPC, the DictionaryType with extension type values is 
exported identically to ExtensionType with dictionary storage (i.e., extension 
metadata at the top level). Because almost no extension type (except possibly 
arrow.opaque, which can support any extension type in theory) actually supports 
DicionaryType storage (i.e., will error when deserialized), this is a somewhat 
confusing default.
   
   For now I will probably expose this as an option since I've implemented both 
deserialization pathways already (and the one that passes the integration tests 
is the least realistic version).
   
   A related Rust issue is https://github.com/apache/arrow-rs/issues/7982 
(where the `DataType` is not capable of representing a DictionaryType with 
extension value type).
   
   <details>
   
   ```python
   import pyarrow as pa
   import nanoarrow as na
   
   class DictEncodedExtensionType(pa.ExtensionType):
       def __init__(self):
           storage_type = pa.dictionary(pa.int32(), pa.string())
           super().__init__(storage_type, "example.dict_encoded_ext")
   
       def __arrow_ext_serialize__(self):
           return b""
   
       @classmethod
       def __arrow_ext_deserialize__(cls, storage_type, serialized):
           assert isinstance(storage_type, pa.DictionaryType)
           return cls()
   
   
   # Register once per process
   try:
       pa.register_extension_type(DictEncodedExtensionType())
   except pa.ArrowKeyError:
       pass
   
   # Create both cases as Arrow types
   extension_dictionary_storage = DictEncodedExtensionType()
   extension_dictionary_storage
   #> DictEncodedExtensionType(DictionaryType(dictionary<values=string, 
indices=int32, ordered=0>))
   
   dictionary_of_extension = pa.dictionary(pa.int32(), pa.uuid())
   dictionary_of_extension
   # DictionaryType(dictionary<values=extension<arrow.uuid>, indices=int32, 
ordered=0>)
   
   na_extension_dictionary_storage = na.c_schema(extension_dictionary_storage)
   na_extension_dictionary_storage
   # <nanoarrow.c_schema.CSchema 
example.dict_encoded_ext{dictionary(int32)<string>}>
   # - format: 'i'
   # - name: ''
   # - flags: 2
   # - metadata:
   #   - b'ARROW:extension:name': b'example.dict_encoded_ext'
   #   - b'ARROW:extension:metadata': b''
   # - dictionary: <nanoarrow.c_schema.CSchema string>
   #   - format: 'u'
   #   - name: ''
   #   - flags: 2
   #   - metadata: NULL
   #   - dictionary: NULL
   #   - children[0]:
   # - children[0]:
   
   na_dictionary_of_extension = na.c_schema(dictionary_of_extension)
   na_dictionary_of_extension
   # <nanoarrow.c_schema.CSchema 
dictionary(int32)<arrow.uuid{fixed_size_binary(16)}>>
   # - format: 'i'
   # - name: ''
   # - flags: 2
   # - metadata: NULL
   # - dictionary: <nanoarrow.c_schema.CSchema 
arrow.uuid{fixed_size_binary(16)}>
   #   - format: 'w:16'
   #   - name: ''
   #   - flags: 2
   #   - metadata:
   #     - b'ARROW:extension:name': b'arrow.uuid'
   #     - b'ARROW:extension:metadata': b''
   #   - dictionary: NULL
   #   - children[0]:
   # - children[0]:
   
   # Can roundtrip both over C Data Interface
   
pa.DataType._import_from_c_capsule(na_extension_dictionary_storage.__arrow_c_schema__())
   # DictEncodedExtensionType(DictionaryType(dictionary<values=string, 
indices=int32, ordered=0>))
   
   
pa.DataType._import_from_c_capsule(na_dictionary_of_extension.__arrow_c_schema__())
   # DictionaryType(dictionary<values=extension<arrow.uuid>, indices=int32, 
ordered=0>)
   
   schema = pa.schema(
       {
           "extension_dictionary_storage": extension_dictionary_storage,
           "dictionary_of_extension": dictionary_of_extension,
       }
   )
   
   schema
   # extension_dictionary_storage: 
extension<example.dict_encoded_ext<DictEncodedExtensionType>>
   # dictionary_of_extension: dictionary<values=extension<arrow.uuid>, 
indices=int32, ordered=0>
   
   schema_bytes = schema.serialize()
   with pa.ipc.open_stream(schema_bytes) as s:
       print(s.schema)
   # ArrowInvalid: Invalid storage type for UuidType: 
dictionary<values=fixed_size_binary[16], indices=int32, ordered=0>
   ```
   
   </details>
   
   ### Component(s)
   
   Format, C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to