andrewthad opened a new issue, #47335:
URL: https://github.com/apache/arrow/issues/47335

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I am using the `Map` type with both the keys and the values as strings. I'm 
able to save a considerable amount of space by using dictionaries to compress 
the keys since the set of all possible keys is very small. Arrow appears to 
support this. For example, suppose we have three people and some arbitrary 
metadata about them:
   
   ```
   >>> pa.__version__
   '20.0.0'
   >>> payload = [[{'key': 'FirstName', 'value': 'Martha'}, {'key': 'LastName', 
'value': 'Washington'}], [{'key': 'FirstName', 'value': 'Cynthia'}], [{'key': 
'FirstName', 'value': 'Steve'}, {'key': 'LastName', 'value': 'Martin'}]]
   >>> table = pa.table([pa.array(payload, 
type=pa.map_(pa.dictionary(pa.int32(), pa.string()), pa.string(), 
keys_sorted=True)), pa.array([50, 67, 300], pa.int32())], names=['attributes', 
'identifier'])
   >>> table
   >>> table
   pyarrow.Table
   attributes: map<dictionary<values=string, indices=int32, ordered=0>, string, 
keys_sorted>
     child 0, entries: struct<key: dictionary<values=string, indices=int32, 
ordered=0> not null, value: string> not null
         child 0, key: dictionary<values=string, indices=int32, ordered=0> not 
null
         child 1, value: string
   identifier: int32
   ----
   attributes: [[keys:    -- dictionary:
   ["FirstName","LastName"]    -- indices:
   [0,1]values:["Martha","Washington"],keys:    -- dictionary:
   ["FirstName","LastName"]    -- indices:
   [0]values:["Cynthia"],keys:    -- dictionary:
   ["FirstName","LastName"]    -- indices:
   [0,1]values:["Steve","Martin"]]]
   identifier: [[50,67,300]]
   ```
   
   This is correct so far. The schema looks exactly like I would expect it to. 
The way the data is printed at the bottom is horribly mangled, but it doesn't 
really matter since the data is all correct.
   
   However, `map_lookup` can no longer be used (it works fine if we do not use 
a dictionary for the keys):
   
   ```
   >>> pc.map_lookup(table.column('attributes'), 'FirstName', 'first')
   Traceback (most recent call last):
   ... suppressed by me ...
   pyarrow.lib.ArrowTypeError: map_lookup: query_key type and Map key_type 
don't match. Expected type: dictionary<values=string, indices=int32, 
ordered=0>, but got type: string
   ```
   
   Let's try giving it a scalar string explicitly:
   
   ```
   >>> pc.map_lookup(table.column('attributes'), pa.scalar('FirstName', 
pa.string()), 'first')
   Traceback (most recent call last):
   ... suppressed by me ...
   pyarrow.lib.ArrowTypeError: map_lookup: query_key type and Map key_type 
don't match. Expected type: dictionary<values=string, indices=int32, 
ordered=0>, but got type: string
   ```
   
   Not any better. Just to be safe, we can try using a scalar dictionary as the 
key (not that this really makes any sense for a query to do this):
   
   ```
   >>> pc.map_lookup(table.column('attributes'), pa.scalar('FirstName', 
pa.dictionary(pa.int32(), pa.string())), 'first')
   Traceback (most recent call last):
   ... suppressed by me ...
   pyarrow.lib.ArrowTypeError: Got unsupported type: dictionary<values=string, 
indices=int32, ordered=0>
   ```
   
   Different error message, but it's what I would expect. It just says that it 
doesn't make any sense to do this, which is correct. This suggests that the 
error message from before is misleading since it says "Expected type: 
dictionary<values=string, indices=int32, ordered=0>" but when you give it a 
scalar with that type, it rejects it.
   
   So the question that I have is whether or not there is any way to perform 
this operation or really even whether dictionary compression is even intended 
to be applied to map keys like this.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to