Imbruced commented on PR #2593: URL: https://github.com/apache/sedona/pull/2593#issuecomment-3908846770
@Kontinuation https://github.com/apache/sedona/pull/2593#pullrequestreview-3783501453 How does it improve the performance of what we already have: - Standard udf Python function transfers data one by one, which suffers from the Python object serialization - Standard udf Python with C serialization code, suffers from the penalty of sending data over the network - Vectorized udfs which we already have, are using the WKB as the internal transfer format, so they suffer from the WKB to Sedona and Sedona to WKB translations Current solution mitigates all those issues: - using Arrow and sends data in batches - using Sedona serde C code to convert the data directly to shapely and from shapely Instead of SedonaDB, we could use GeoPandas, as Apache Spark already does with Pandas. However, we already have SedonaDB in the ecosystem, so why not use it? I guess the Python UDFs in SedonaDB will be improved, now in the most optimized version we run Shapely over the Arrow arrays, which is more efficient than we already have to the point where the buffer version is faster than the native one we have in Sedona. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
