bigluck opened a new issue, #593:
URL: https://github.com/apache/iceberg-python/issues/593

   ### Feature Request / Improvement
   
   Ciao all,
   
   I'm looking for a way to bypass the limited number of supported filters on 
Pyiceberg without raising an out-of-memory error on my running instance (see 
#170).
   
   A user, for example, can query by `id = 12` (that's fine, it's supported) 
but she can also compose a complex query like `id = 12 or LOG(a_number) = 
12345` (this filter does not make any sense, but it give an idea of complex 
filters).
   
   In this case, I'm forced to transform her query into `id = 12 or NOT 
a_number IS NULL`, and once I have the final arrow table, I filter it by the 
actual user's filters.
   
   ```python
   tmp_res = table.scan(
       row_filter='id = 12 or NOT a_number IS NULL',
       selected_fields=('id', 'a_number')
   ).to_arrow()
   res = apply_users_filters(
       table=tmp_res,
       filters='id = 12 or LOG(a_number) = 12345',
   )
   ```
   
   But it's risky, especially if the user is scanning an xTB table and, for 
some reason, the column she's filtering is/are strings.
   
   As a temporary workaround, is it possible to extend the `scan` & 
`project_table` functions to support arbitrary Python lambda functions invoked 
every time a new file is scanned to filter out all the unnecessary records 
before merging them into the final table?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to