AlienKevin opened a new issue, #43274: URL: https://github.com/apache/arrow/issues/43274
### Describe the bug, including details regarding any error messages, version, and platform. Based on the [doc](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFragment.html#pyarrow.dataset.ParquetFileFragment.subset), `ParquetFileFragment.subset` creates a subset of the fragment and returns the same type `ParquetFileFragment` as output. However, when I continue to iterate through the fragment via `to_batches`, I found that the subset filter is not applied. Whether the `subset` filtering is implemented eagerly or lazily, I expect to be able to chain further operations like `to_batches` after it, since the output type is still a `ParquetFileFragment`. Here's a snippet to reproduce the behavior I mentioned. You would need a `test.parquet` containing a `character` column with values like 'a', 'b', 'c', 'd'. After running this script, you should see: `Characters in unique_chars but not in charset: d`. ```python import pyarrow.parquet as pq import pyarrow.compute as pc charset = ['a', 'b', 'c'] filter = pc.field('character').isin(charset) dataset = pq.ParquetDataset('test.parquet', filters=filter) unique_chars = set() for fragment in dataset.fragments: # fragment = fragment.subset(filter) for batch in fragment.to_batches(): df = batch.to_pandas() for character in df['character']: unique_chars.add(character) diff_chars = unique_chars.difference(charset) print("Characters in unique_chars but not in charset:", diff_chars) ``` ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org