g0di opened a new issue, #37954:
URL: https://github.com/apache/arrow/issues/37954
### Describe the usage question you have. Please include as many useful
details as possible.
Hello Folks,
I'm currently building a small client on top of PyArrow. I'm facing cases
where I can end up adding several filters of the same type on the same columns
but with different values. I was wondering if I actually need somehow to
manually reduce and eventually optimize those filters before passing them to
pyarrow
For example
```python
import pyarrow.dataset as ds
import pyarrow.compute as pc
dataset = ds.dataset(f"path/to/my/dataset",
partitioning=ds.partitioning(flavor="hive"))
# Example 1 - optimizing "isin"
dataset.filter(pc.field("foo").isin(["a"]) & pc.field("foo").isin(["b"]))
# Versus
dataset.filter(pc.field("foo").isin(["a", "b"]))
# Example 2 - combining equality and containments filters
dataset.filter(pc.field("foo").isin(["a"]) & pc.field("foo") == "b"))
# Versus
dataset.filter(pc.field("foo").isin(["a", "b"]))
# Example 3 - combining equality filters
dataset.filter(pc.field("foo") == "a" & pc.field("foo") == "b"))
# Versus
dataset.filter(pc.field("foo").isin(["a", "b"]))
```
Are those kind of optimization necessary? If yes, do you know how much this
could affect performances?
>pyarrow 13.0.0
>Python 3.10
>Windows 10 Profesional
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]