bigluck opened a new issue, #593: URL: https://github.com/apache/iceberg-python/issues/593
### Feature Request / Improvement Ciao all, I'm looking for a way to bypass the limited number of supported filters on Pyiceberg without raising an out-of-memory error on my running instance (see #170). A user, for example, can query by `id = 12` (that's fine, it's supported) but she can also compose a complex query like `id = 12 or LOG(a_number) = 12345` (this filter does not make any sense, but it give an idea of complex filters). In this case, I'm forced to transform her query into `id = 12 or NOT a_number IS NULL`, and once I have the final arrow table, I filter it by the actual user's filters. ```python tmp_res = table.scan( row_filter='id = 12 or NOT a_number IS NULL', selected_fields=('id', 'a_number') ).to_arrow() res = apply_users_filters( table=tmp_res, filters='id = 12 or LOG(a_number) = 12345', ) ``` But it's risky, especially if the user is scanning an xTB table and, for some reason, the column she's filtering is/are strings. As a temporary workaround, is it possible to extend the `scan` & `project_table` functions to support arbitrary Python lambda functions invoked every time a new file is scanned to filter out all the unnecessary records before merging them into the final table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org