Fokko commented on code in PR #6775: URL: https://github.com/apache/iceberg/pull/6775#discussion_r1112096974
########## python/pyiceberg/io/pyarrow.py: ########## @@ -515,6 +529,14 @@ def _file_to_table( if pyarrow_filter is not None: arrow_table = arrow_table.filter(pyarrow_filter) + if len(positional_deletes) > 0: + # When there are positional deletes, create a filter mask + mask = [True] * len(arrow_table) + for buffer in positional_deletes: + for pos in buffer: + mask[pos.as_py()] = False + arrow_table = arrow_table.filter(mask) Review Comment: What do you think of the following: ``` def generator(): itr = iter(positional_deletes) next_delete = next(itr) for pos in range(len(arrow_table)): if pos == next_delete: yield True next_delete = next(itr) else: yield False mask = pa.array(generator(), type=pa.bool_()) arrow_table = arrow_table.filter(mask) ``` This creates an iterator (`range` in Python is also an iterator) that will materialize the array and this way we avoid creating large collections on the Python side. ########## python/pyiceberg/io/pyarrow.py: ########## @@ -515,6 +529,14 @@ def _file_to_table( if pyarrow_filter is not None: arrow_table = arrow_table.filter(pyarrow_filter) + if len(positional_deletes) > 0: + # When there are positional deletes, create a filter mask + mask = [True] * len(arrow_table) + for buffer in positional_deletes: + for pos in buffer: + mask[pos.as_py()] = False + arrow_table = arrow_table.filter(mask) Review Comment: What do you think of the following: ```python def generator(): itr = iter(positional_deletes) next_delete = next(itr) for pos in range(len(arrow_table)): if pos == next_delete: yield True next_delete = next(itr) else: yield False mask = pa.array(generator(), type=pa.bool_()) arrow_table = arrow_table.filter(mask) ``` This creates an iterator (`range` in Python is also an iterator) that will materialize the array and this way we avoid creating large collections on the Python side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org