[GitHub] [arrow] g0di opened a new issue, #37954: Python API - Do I need to optimize filters for querying a dataset?

via GitHub Fri, 29 Sep 2023 05:32:12 -0700


g0di opened a new issue, #37954:
URL: https://github.com/apache/arrow/issues/37954


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hello Folks,
   
   I'm currently building a small client on top of PyArrow. I'm facing cases 
where I can end up adding several filters of the same type on the same columns 
but with different values. I was wondering if I actually need somehow to 
manually reduce and eventually optimize those filters before passing them to 
pyarrow
   
   For example
   ```python
   import pyarrow.dataset as ds
   import pyarrow.compute as pc
   
   dataset = ds.dataset(f"path/to/my/dataset", 
partitioning=ds.partitioning(flavor="hive"))
   # Example 1 - optimizing "isin"
   dataset.filter(pc.field("foo").isin(["a"]) & pc.field("foo").isin(["b"]))
   # Versus
   dataset.filter(pc.field("foo").isin(["a", "b"]))
   
   # Example 2 - combining equality and containments filters
   dataset.filter(pc.field("foo").isin(["a"]) & pc.field("foo") == "b"))
   # Versus
   dataset.filter(pc.field("foo").isin(["a", "b"]))
   
   # Example 3 - combining equality filters
   dataset.filter(pc.field("foo") == "a" & pc.field("foo") == "b"))
   # Versus
   dataset.filter(pc.field("foo").isin(["a", "b"]))
   ```
   
   Are those kind of optimization necessary? If yes, do you know how much this 
could affect performances?
   
   >pyarrow 13.0.0
   >Python 3.10
   >Windows 10 Profesional 
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] g0di opened a new issue, #37954: Python API - Do I need to optimize filters for querying a dataset?

Reply via email to