PavloPolovyi opened a new issue, #4271:
URL: https://github.com/apache/arrow-adbc/issues/4271

   ### What would you like help with?
   
   Hey!
   
   We've been using both `adbc_snowflake` and `databricks-adbc` to pull data, 
and we noticed that string columns always come back as plain `Utf8`, no matter 
how repetitive the values are. Even when a 1M-row column has only 10 distinct 
values, we get a flat `Utf8` array rather than `Dictionary<Int32, Utf8>`.
   
   A few questions:
   
   1. **Is this the intended behavior?** We weren't sure whether the warehouse 
ships strings as dictionary-encoded on the wire and the driver flattens them, 
or whether they always come in flat from the source.
   2. **Is there an option we're missing?** Something to either preserve 
dictionary encoding (if it exists upstream), or otherwise get 
`Dictionary<Int32, Utf8>` back from the driver?
   3. **If there isn't, would you be open to adding one?** Something like a 
cardinality threshold — columns where the distinct-to-row-count ratio is below 
some limit get cast to `Dictionary<Int32, Utf8>` before reaching the consumer.
   
   It matters to us because we serialize the data to Arrow IPC and send it over 
a network. Dictionary-encoded strings are a few times smaller on the wire and 
the decoder on the other end has a fast path for that type. Today we walk each 
`RecordBatch` ourselves and cast qualifying `Utf8` columns to dictionaries — it 
works, but it feels like the kind of thing that should live in the driver, 
especially since every consumer doing IPC over a network probably wants the 
same thing.
   
   Happy to share more detail if useful. Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to