DanielElisenberg opened a new issue, #46777:
URL: https://github.com/apache/arrow/issues/46777

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   # Severe performance regression in isin() filter After pyarrow v18
   
   ### Introduction
   We use pyarrow at work for statistical analysis of big populations. One of 
the key features we
   use to achieve this is the incredibly fast predicate-pushdown-filters. We 
use these to only retrieve data for our
   current working population.
   We discovered a bug/performance-issue that seems to have been introduced in 
version 18 of pyarrow. This coincides
   with the removal of numpy as a dependency. This seems to affect the 
`isin`-filter which handles arrays, so a link to
   the removal of numpy seems feasible
   
   ### Reproducible example
   I have made a [repository containing a reproducible example 
here](https://github.com/DanielElisenberg/pyarrow-issue-isin).
   It requires you to have python version 3.11.* installed and poetry. Run the 
example with the shell script at the root of the
   repository:
   ```sh
   ./run_tests.sh
   ```
   
   ### Explanation
   Let's say we have a Parquet file that contains an observation for each unit 
in a population. It also constains the
   start and stop time of this observation:
   
   | unit_id    | observation | start | stop |
   | ---------- | ----------- | ----- | ---- |
   | 1          | 23          | 123   | 456  |
   | 2          | 23          | 123   | 456  |
   | ...        | ...         | ...   | ...  |
   | 50_000_000 | 23          | 123   | 456  |
   
   The dataset includes 50 million units, each with a single observation. Now 
let's say I was currently working with
   a population of the 10 million first units. I would then use predicate 
pushdown with the `isin` filter to read only
   the relevant population:
   ```py
   from pyarrow import dataset
   
   parquet_path = "big_dataset"
   population = [unit_id for unit_id in range(1, 10_000_001)]
   population_filter = dataset.field("unit_id").isin(population)
   dataset.dateset(parquet_path).to_table(population_filter)
   ```
   In earlier pyarrow versions, this was blazingly fast, but after v18, it has 
become up to 40 times slower. It is to the point
   where we can't upgrade. Previously, our analytics services could handle 
large volumes of requests per minute, but now they freeze.
   
   ### Stats
   Running the code in [the repository containing a reproducible 
example](https://github.com/DanielElisenberg/pyarrow-issue-isin) yields
   these results on my local machine:
   ```
   Generated in-memory table with 50,000,000 rows. Now writing...
   ✅ Done. Data written to: ../BIG_DATASET
   
   === PYARROW VERSION 15 ===
   Retrieved 10,000,000 rows in 1.71 seconds.
   
   === PYARROW VERSION 16 ===
   Retrieved 10,000,000 rows in 1.67 seconds.
   
   === PYARROW VERSION 17 ===
   Retrieved 10,000,000 rows in 1.66 seconds.
   
   === PYARROW VERSION 18 ===
   Retrieved 10,000,000 rows in 61.83 seconds.
   
   === PYARROW VERSION 19 ===
   Retrieved 10,000,000 rows in 61.09 seconds.
   
   === PYARROW VERSION 20 ===
   Retrieved 10,000,000 rows in 61.53 seconds.
   ```
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to