cooldome opened a new issue, #49646:
URL: https://github.com/apache/arrow/issues/49646

   ### Describe the enhancement requested
   
   Arrow ipc format assigns `dict_id` to dictionary arrays and serializes 
dictionaries first
   It is allowed to use the same dictionary in multiple columns thanks to 
`dict_id`.
   
   Currently ipc writer always assigns new dict_id to every dictionary 
encountered. Hence, while the dictionary deduplication is supported by the 
format, it can't be exercised by the user.
   
   I suggest to add to `IpcWriteOptions` new option ` dedup_dictionaries`. 
Please opine on the name of the option.
   If enabled the writer will check every's dictionary buffer, if the 
underlying buffer is the same the ipc writer will serialize unique dictionary 
to the file once by reusing dict_id.
   
   I am interested in this feature and I can implement it if there are no 
objections across the dev community.
   
   I can do C++ and Python parts.
   
   
   
    
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to