lllangWV opened a new issue, #45113:
URL: https://github.com/apache/arrow/issues/45113

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   **Title:** Improving Deserialization Speed for PyArrow to Python Objects  
   
   Hello,  
   
   I am working with materials data stored in Parquet files, where a column 
`structure` contains serialized dictionaries representing structures from the 
`Structure` class in the `pymatgen` package. This class stores site and lattice 
information and provides a `.to_dict()` method for serialization.  
   
   I have a dataset of ~80,000 structures. To deserialize these into 
`Structure` objects, I use the following process:  
   
   ```python
   ds = ds.dataset(dataset_dir, format="parquet")
   table = ds.to_table(columns=['structure'])
   df = table.to_pandas()  # ~8.20 seconds
   df['structure_py'] = df['structure'].map(Structure.from_dict)  # ~116 seconds
   ```
   
   The majority of the time is spent mapping the dictionaries to `Structure` 
objects via `Structure.from_dict`. I attempted using `pa.ExtensionArray` and 
`pa.ExtensionType` to optimize this process but achieved similar performance, 
as the bottleneck appears to be in the `Structure.from_dict` calls.  
   
   Here's an example of my `ExtensionType` implementation:  
   
   ```python
   class StructureType(pa.ExtensionType):
       def __init__(self, data_type: pa.DataType):
           if not pa.types.is_struct(data_type):
               raise TypeError(f"data_type must be a struct type, not 
{data_type}")
           super().__init__(data_type, "matgraphdb.structure")
   
       def __arrow_ext_serialize__(self) -> bytes:
           return b""
   
       @classmethod
       def __arrow_ext_deserialize__(cls, storage_type, serialized):
           assert pa.types.is_struct(storage_type)
           return StructureType(storage_type)
   
       def __arrow_ext_class__(self):
           return StructureArray
   
   class StructureArray(pa.ExtensionArray):
       def to_structure(self):
           return self.storage.to_pandas().map(Structure.from_dict)
   ```
   
   Despite these efforts, the deserialization time remains substantial. Below 
is the type of the `structure` column:  
   
   ```python
   struct<@class: string, @module: string, charge: double, lattice: struct<a: 
double, alpha: double, b: double, beta: double, c: double, gamma: double, ...>, 
sites: list<element: struct<...>>>
   ```
   
   Is there a recommended approach within PyArrow to speed up deserialization 
of such complex structured data into Python objects?  
   
   Best regards,  
   Logan Lang  
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to