ypsah opened a new issue, #2037: URL: https://github.com/apache/iceberg-python/issues/2037
### Feature Request / Improvement Hi, apologies in advance for the rather broad scope. I am using `pyiceberg` to store datasets partitioned in two dimensions and am using basic filtering to retrieve specific parts of the space, e.g. `(x = 0 AND y = 0) OR (x = 1 AND y = 1) OR (x = -1 and y = 2)`. Cardinality is relatively modest with a little over 200 partitions and depending on the dataset, data size can range from under a MiB to over 10GiB. I have noticed that when I try to retrieve the whole dataset, but still supply the corresponding filter using only `AND`, `OR` and `=` operators, I pay a rather large overhead, and that overhead gets proportionally better if I factorize the filter using two `IN` operators, i.e. `x in (0, 1, 2, ...) AND y IN (-5, -4, -3, ...)`. For the smallest datasets I manage, the difference is quite noticeable. To load the full dataset: - without filters: 15s - with `x IN (...) AND y IN (...)`: 45s - with `(x = ... AND y = ...) OR (x = ... AND y = ...) OR ...`: 8m30s I am planning on optimizing this on my side as much as possible, by factorizing my filters as much as possible (while still working around https://github.com/apache/iceberg-python/issues/1937 / https://github.com/apache/arrow/issues/46183), but I am wondering if we might be able to optimize things a little bit on `pyiceberg`'s side as well. Most importantly, given my datasets are partitioned on `x` and `y`, little time is effectively spent _applying_ the filter, and it appears that most of the overhead is coming from pre-processing the expression objects themselves. To highlight this, I produced a flamegraph for two runs, one using only the `=` comparison operator and another using `IN` operators instead (both filters covering the same partitions eventually), and using the `search` functionality of the resulting svg, I highlighted the samples that related to the `expression` package of `pyiceberg`: <img width="1867" alt="py-spy record -o profile.svg -- python reproducer-equalto.py" src="https://github.com/user-attachments/assets/fabe479a-9b2a-49fb-963d-2eb123c298bd" /> <img width="934" alt="py-spy record -o profile.svg -- python reproducer-equalto.py" src="https://github.com/user-attachments/assets/b3e99af4-9e82-4055-92d2-89fde8a2cecc" /> Note how the flamegraph for the run that uses `=` spent 83.5% of its CPU time in code related to `expressions`, whereas the run that uses `IN` only spent 6.4% in the same parts of `pyiceberg`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org