cooldome opened a new issue, #49646:
URL: https://github.com/apache/arrow/issues/49646
### Describe the enhancement requested
Arrow ipc format assigns `dict_id` to dictionary arrays and serializes
dictionaries first
It is allowed to use the same dictionary in multiple columns thanks to
`dict_id`.
Currently ipc writer always assigns new dict_id to every dictionary
encountered. Hence, while the dictionary deduplication is supported by the
format, it can't be exercised by the user.
I suggest to add to `IpcWriteOptions` new option ` dedup_dictionaries`.
Please opine on the name of the option.
If enabled the writer will check every's dictionary buffer, if the
underlying buffer is the same the ipc writer will serialize unique dictionary
to the file once by reusing dict_id.
I am interested in this feature and I can implement it if there are no
objections across the dev community.
I can do C++ and Python parts.
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]