aoli-al opened a new issue, #49612:
URL: https://github.com/apache/arrow/issues/49612

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   `pd.options.mode.string_storage = "pyarrow"` causes a large slowdown when 
repeatedly growing a string-typed `DataFrame` with `loc` row assignment.
   
   The performance issue largely goes away if I switch to:
   
   ```python
   pd.options.mode.string_storage = "python"
   ```
   
   ## Versions
   
   ```text
   pandas=3.0.1
   pyarrow=23.0.1
   python=3.12
   platform=Linux
   ```
   
   ## Minimal reproducer
   
   ```python
   import time
   
   import pandas as pd
   import pyarrow as pa
   
   
   def bench(storage: str, rows: int = 1000, cols: int = 20) -> float:
       pd.options.mode.string_storage = storage
   
       source = pd.DataFrame(
           [[f"v{j % 10}" for j in range(cols)] for _ in range(rows)]
       ).astype(str)
       out = pd.DataFrame(columns=source.columns).astype(str)
   
       start = time.perf_counter()
       for i, row in enumerate(source.itertuples(index=False)):
           out.loc[i] = row
       return time.perf_counter() - start
   
   
   print(f"pandas={pd.__version__} pyarrow={pa.__version__}")
   for storage in ("python", "pyarrow"):
       elapsed = bench(storage)
       print(storage, elapsed)
   ```
   
   ## Output on my machine
   
   ```text
   pandas=3.0.1 pyarrow=23.0.1 rows=1000 cols=20
   storage=python  array=StringArray       seconds=0.420
   storage=pyarrow array=ArrowStringArray  seconds=3.316
   slowdown(pyarrow/python)=7.89x
   ```
   
   I also see the same pattern with smaller sizes, for example:
   
   ```text
   500x10:  python=0.147s  pyarrow=0.508s
   500x20:  python=0.200s  pyarrow=0.930s
   1000x10: python=0.292s  pyarrow=1.759s
   1000x20: python=0.411s  pyarrow=3.358s
   1500x20: python=0.624s  pyarrow=7.174s
   ```
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to