jorisvandenbossche opened a new issue, #44068:
URL: https://github.com/apache/arrow/issues/44068

   This issue is to discuss the idea of moving a significant part of the pandas 
conversion and compatibility code to the pandas project itself. Of course we 
would keep all low-level conversions (e.g. everything that lives in C++) at the 
array-level, but a large part of `pandas_compat.py` could live in pandas.
   
   Some reasons to do this:
   - It's a lot of pandas specific code that might "fit" better in pandas itself
   - It would allow pandas to control the conversion more tightly
     - Example: now with upcoming pandas 3.0 and the new string dtype, pandas 
could ensure to use that new dtype in any conversion, while now with older 
versions of pyarrow `to_pandas()` will still give object dtype 
(https://github.com/apache/arrow/issues/43683)
   - The required low-level functionality in pyarrow should now also be stable 
enough to allow having this code live in pandas itself (which might not have 
been the case at the inception of pyarrow)
   
   A potential downside is that it makes the dependency structure even more 
complex (pyarrow's `to_pandas()` relying on pandas relying on pyarrow), 
although we already have infrastructure set up to lazily import pandas.
   
   The idea is not that we would change any public pyarrow API that supports 
pandas (ingesting pandas in various constructors, `to_pandas()` methods on 
objects), but that at least for the DataFrame and Series level, we under the 
hood rely on a method from pandas to do that conversion. 
   For example, I think that most of the handling of the "pandas metadata" (to 
guarantee a better pandas <-> arrow roundtrip) could live in pandas itself.
   
   Eventually that would allow us to remove a lot of this pandas compatibility 
code from pyarrow, but note that this is very much a long term goal as we will 
need to keep that code around until we drop support for all pandas versions 
older than the version that would add this functionality to pandas.  
   (so that is another downside, that short term it might increase maintenance 
effort because of a version of that code living in two places)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to