jorisvandenbossche opened a new issue, #44068:
URL: https://github.com/apache/arrow/issues/44068
This issue is to discuss the idea of moving a significant part of the pandas
conversion and compatibility code to the pandas project itself. Of course we
would keep all low-level conversions (e.g. everything that lives in C++) at the
array-level, but a large part of `pandas_compat.py` could live in pandas.
Some reasons to do this:
- It's a lot of pandas specific code that might "fit" better in pandas itself
- It would allow pandas to control the conversion more tightly
- Example: now with upcoming pandas 3.0 and the new string dtype, pandas
could ensure to use that new dtype in any conversion, while now with older
versions of pyarrow `to_pandas()` will still give object dtype
(https://github.com/apache/arrow/issues/43683)
- The required low-level functionality in pyarrow should now also be stable
enough to allow having this code live in pandas itself (which might not have
been the case at the inception of pyarrow)
A potential downside is that it makes the dependency structure even more
complex (pyarrow's `to_pandas()` relying on pandas relying on pyarrow),
although we already have infrastructure set up to lazily import pandas.
The idea is not that we would change any public pyarrow API that supports
pandas (ingesting pandas in various constructors, `to_pandas()` methods on
objects), but that at least for the DataFrame and Series level, we under the
hood rely on a method from pandas to do that conversion.
For example, I think that most of the handling of the "pandas metadata" (to
guarantee a better pandas <-> arrow roundtrip) could live in pandas itself.
Eventually that would allow us to remove a lot of this pandas compatibility
code from pyarrow, but note that this is very much a long term goal as we will
need to keep that code around until we drop support for all pandas versions
older than the version that would add this functionality to pandas.
(so that is another downside, that short term it might increase maintenance
effort because of a version of that code living in two places)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]