Imbruced commented on PR #2593:
URL: https://github.com/apache/sedona/pull/2593#issuecomment-3908846770

   @Kontinuation 
   https://github.com/apache/sedona/pull/2593#pullrequestreview-3783501453
   
   How does it improve the performance of what we already have:
   - Standard udf Python function transfers data one by one, which suffers from 
the Python object serialization
   - Standard udf Python with C serialization code, suffers from the penalty of 
sending data over the network
   - Vectorized udfs which we already have, are using the WKB as the internal 
transfer format, so they suffer from the WKB to Sedona and Sedona to WKB 
translations
   
   Current solution mitigates all those issues:
   - using Arrow and sends data in batches
   - using Sedona serde C code to convert the data directly to shapely and from 
shapely
   
   Instead of SedonaDB, we could use GeoPandas, as Apache Spark already does 
with Pandas. However, we already have SedonaDB in the ecosystem, so why not use 
it? I guess the Python UDFs in SedonaDB will be improved, now in the most 
optimized version we run Shapely over the Arrow arrays, which is more efficient 
than we already have to the point where the buffer version is faster than the 
native one we have in Sedona.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to