lllangWV opened a new issue, #45113: URL: https://github.com/apache/arrow/issues/45113
### Describe the usage question you have. Please include as many useful details as possible. **Title:** Improving Deserialization Speed for PyArrow to Python Objects Hello, I am working with materials data stored in Parquet files, where a column `structure` contains serialized dictionaries representing structures from the `Structure` class in the `pymatgen` package. This class stores site and lattice information and provides a `.to_dict()` method for serialization. I have a dataset of ~80,000 structures. To deserialize these into `Structure` objects, I use the following process: ```python ds = ds.dataset(dataset_dir, format="parquet") table = ds.to_table(columns=['structure']) df = table.to_pandas() # ~8.20 seconds df['structure_py'] = df['structure'].map(Structure.from_dict) # ~116 seconds ``` The majority of the time is spent mapping the dictionaries to `Structure` objects via `Structure.from_dict`. I attempted using `pa.ExtensionArray` and `pa.ExtensionType` to optimize this process but achieved similar performance, as the bottleneck appears to be in the `Structure.from_dict` calls. Here's an example of my `ExtensionType` implementation: ```python class StructureType(pa.ExtensionType): def __init__(self, data_type: pa.DataType): if not pa.types.is_struct(data_type): raise TypeError(f"data_type must be a struct type, not {data_type}") super().__init__(data_type, "matgraphdb.structure") def __arrow_ext_serialize__(self) -> bytes: return b"" @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): assert pa.types.is_struct(storage_type) return StructureType(storage_type) def __arrow_ext_class__(self): return StructureArray class StructureArray(pa.ExtensionArray): def to_structure(self): return self.storage.to_pandas().map(Structure.from_dict) ``` Despite these efforts, the deserialization time remains substantial. Below is the type of the `structure` column: ```python struct<@class: string, @module: string, charge: double, lattice: struct<a: double, alpha: double, b: double, beta: double, c: double, gamma: double, ...>, sites: list<element: struct<...>>> ``` Is there a recommended approach within PyArrow to speed up deserialization of such complex structured data into Python objects? Best regards, Logan Lang ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org