andrewthad opened a new issue, #47335:
URL: https://github.com/apache/arrow/issues/47335
### Describe the usage question you have. Please include as many useful
details as possible.
I am using the `Map` type with both the keys and the values as strings. I'm
able to save a considerable amount of space by using dictionaries to compress
the keys since the set of all possible keys is very small. Arrow appears to
support this. For example, suppose we have three people and some arbitrary
metadata about them:
```
>>> pa.__version__
'20.0.0'
>>> payload = [[{'key': 'FirstName', 'value': 'Martha'}, {'key': 'LastName',
'value': 'Washington'}], [{'key': 'FirstName', 'value': 'Cynthia'}], [{'key':
'FirstName', 'value': 'Steve'}, {'key': 'LastName', 'value': 'Martin'}]]
>>> table = pa.table([pa.array(payload,
type=pa.map_(pa.dictionary(pa.int32(), pa.string()), pa.string(),
keys_sorted=True)), pa.array([50, 67, 300], pa.int32())], names=['attributes',
'identifier'])
>>> table
>>> table
pyarrow.Table
attributes: map<dictionary<values=string, indices=int32, ordered=0>, string,
keys_sorted>
child 0, entries: struct<key: dictionary<values=string, indices=int32,
ordered=0> not null, value: string> not null
child 0, key: dictionary<values=string, indices=int32, ordered=0> not
null
child 1, value: string
identifier: int32
----
attributes: [[keys: -- dictionary:
["FirstName","LastName"] -- indices:
[0,1]values:["Martha","Washington"],keys: -- dictionary:
["FirstName","LastName"] -- indices:
[0]values:["Cynthia"],keys: -- dictionary:
["FirstName","LastName"] -- indices:
[0,1]values:["Steve","Martin"]]]
identifier: [[50,67,300]]
```
This is correct so far. The schema looks exactly like I would expect it to.
The way the data is printed at the bottom is horribly mangled, but it doesn't
really matter since the data is all correct.
However, `map_lookup` can no longer be used (it works fine if we do not use
a dictionary for the keys):
```
>>> pc.map_lookup(table.column('attributes'), 'FirstName', 'first')
Traceback (most recent call last):
... suppressed by me ...
pyarrow.lib.ArrowTypeError: map_lookup: query_key type and Map key_type
don't match. Expected type: dictionary<values=string, indices=int32,
ordered=0>, but got type: string
```
Let's try giving it a scalar string explicitly:
```
>>> pc.map_lookup(table.column('attributes'), pa.scalar('FirstName',
pa.string()), 'first')
Traceback (most recent call last):
... suppressed by me ...
pyarrow.lib.ArrowTypeError: map_lookup: query_key type and Map key_type
don't match. Expected type: dictionary<values=string, indices=int32,
ordered=0>, but got type: string
```
Not any better. Just to be safe, we can try using a scalar dictionary as the
key (not that this really makes any sense for a query to do this):
```
>>> pc.map_lookup(table.column('attributes'), pa.scalar('FirstName',
pa.dictionary(pa.int32(), pa.string())), 'first')
Traceback (most recent call last):
... suppressed by me ...
pyarrow.lib.ArrowTypeError: Got unsupported type: dictionary<values=string,
indices=int32, ordered=0>
```
Different error message, but it's what I would expect. It just says that it
doesn't make any sense to do this, which is correct. This suggests that the
error message from before is misleading since it says "Expected type:
dictionary<values=string, indices=int32, ordered=0>" but when you give it a
scalar with that type, it rejects it.
So the question that I have is whether or not there is any way to perform
this operation or really even whether dictionary compression is even intended
to be applied to map keys like this.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]