Fokko commented on PR #6775: URL: https://github.com/apache/iceberg/pull/6775#issuecomment-1460088078
Did another pass: - Added `_OrderedChunkedArrayConsumer` so we don't have to reallocate and sort into a single array. We can even further optimize this by using a tree structure of iterators. But the [ChunkedArray is currently not iterable]( https://github.com/apache/arrow/issues/34495), so we would need to wrap this. Also, there is no `.peek` functionality in Python, so we would have to add [another wrapper for that](https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.peekable). I expect the number of deleted files that affect a data file to be fairly small, so I think we're good here. - Fixed a rather nasty bug where we first filtered a table, and then applied the positional deletes. This would mess up the positions. Instead, when there are positional deletes, I first read all the data, and then filter on the positions. Of course, this breaks the predicate pushdown and we'll read everything into Arrow buffers. I think it would be great to merge https://github.com/apache/iceberg/pull/6398 first so we can add some integration tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org