petern48 commented on PR #2038:
URL: https://github.com/apache/sedona/pull/2038#issuecomment-3028966983
@zhangfengcdt I think we're mostly on the same page actually. The spark
index column (`__index_level_{}__`) I'm using does actually represent the index
in geopandas. See the comment in the from the pyspark codebase
[here](https://github.com/apache/spark/blob/master/python/pyspark/pandas/internal.py)
below
```python
# A function to turn given numbers to Spark columns that represent
pandas-on-Spark index.
SPARK_INDEX_NAME_FORMAT = "__index_level_{}__".format
SPARK_DEFAULT_INDEX_NAME = SPARK_INDEX_NAME_FORMAT(0)
```
> However, if no index is used in the GeoSeries creation, then we don't need
to support alignment
If no index is given, pandas on pyspark creates a default index which we can
use for the `align=True`. This is what the current tests use since we don't yet
have index support.
Originally, I was proposing not to support `align=False`, where geopandas
uses the "natural ordering" of the series instead of the given index. However,
it looks like Pandas on PySpark does already have a [hidden natural ordering
column](https://github.com/apache/spark/blob/a1e628574b7d9cdf89472fa550ecc41f8a871b98/python/pyspark/pandas/internal.py#L77-L79),
so we can try using that.
Regardless, if the current default `align=True` logic sounds good to you,
I'd rather merge this in now and revisit additional functionality
(`align=False`) later when we add indexes (creating a separate issue of
course). Does that make sense?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]