ypsah opened a new issue, #2037:
URL: https://github.com/apache/iceberg-python/issues/2037

   ### Feature Request / Improvement
   
   Hi, apologies in advance for the rather broad scope.
   
   I am using `pyiceberg` to store datasets partitioned in two dimensions and 
am using basic filtering to retrieve specific parts of the space, e.g. `(x = 0 
AND y = 0) OR (x = 1 AND y = 1) OR (x = -1 and y = 2)`. Cardinality is 
relatively modest with a little over 200 partitions and depending on the 
dataset, data size can range from under a MiB to over 10GiB.
   
   I have noticed that when I try to retrieve the whole dataset, but still 
supply the corresponding filter using only `AND`, `OR` and `=` operators, I pay 
a rather large overhead, and that overhead gets proportionally better if I 
factorize the filter using two `IN` operators, i.e. `x in (0, 1, 2, ...) AND y 
IN (-5, -4, -3, ...)`.
   
   For the smallest datasets I manage, the difference is quite noticeable. To 
load the full dataset:
   - without filters: 15s
   - with `x IN (...) AND y IN (...)`: 45s
   - with `(x = ... AND y = ...) OR (x = ... AND y = ...) OR ...`: 8m30s
   
   I am planning on optimizing this on my side as much as possible, by 
factorizing my filters as much as possible (while still working around 
https://github.com/apache/iceberg-python/issues/1937 / 
https://github.com/apache/arrow/issues/46183), but I am wondering if we might 
be able to optimize things a little bit on `pyiceberg`'s side as well.
   
   Most importantly, given my datasets are partitioned on `x` and `y`, little 
time is effectively spent _applying_ the filter, and it appears that most of 
the overhead is coming from pre-processing the expression objects themselves.
   
   To highlight this, I produced a flamegraph for two runs, one using only the 
`=` comparison operator and another using `IN` operators instead (both filters 
covering the same partitions eventually), and using the `search` functionality 
of the resulting svg, I highlighted the samples that related to the 
`expression` package of `pyiceberg`:
   
   <img width="1867" alt="py-spy record -o profile.svg -- python 
reproducer-equalto.py" 
src="https://github.com/user-attachments/assets/fabe479a-9b2a-49fb-963d-2eb123c298bd";
 />
   
   <img width="934" alt="py-spy record -o profile.svg -- python 
reproducer-equalto.py" 
src="https://github.com/user-attachments/assets/b3e99af4-9e82-4055-92d2-89fde8a2cecc";
 />
   
   Note how the flamegraph for the run that uses `=` spent 83.5% of its CPU 
time in code related to `expressions`, whereas the run that uses `IN` only 
spent 6.4% in the same parts of `pyiceberg`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to