[I] [Python] `ParquetFileFragment.subset` filter ignored when chained with `to_batches` [arrow]

via GitHub Tue, 16 Jul 2024 01:27:21 -0700


AlienKevin opened a new issue, #43274:
URL: https://github.com/apache/arrow/issues/43274


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Based on the 
[doc](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFragment.html#pyarrow.dataset.ParquetFileFragment.subset),
 `ParquetFileFragment.subset` creates a subset of the fragment and returns the 
same type `ParquetFileFragment` as output. However, when I continue to iterate 
through the fragment via `to_batches`, I found that the subset filter is not 
applied. Whether the `subset` filtering is implemented eagerly or lazily, I 
expect to be able to chain further operations like `to_batches` after it, since 
the output type is still a `ParquetFileFragment`.
   
   Here's a snippet to reproduce the behavior I mentioned. You would need a 
`test.parquet` containing a `character` column with values like 'a', 'b', 'c', 
'd'. After running this script, you should see: `Characters in unique_chars but 
not in charset: d`.
   
   ```python
   import pyarrow.parquet as pq
   import pyarrow.compute as pc
   
   charset = ['a', 'b', 'c']
   
   filter = pc.field('character').isin(charset)
   
   dataset = pq.ParquetDataset('test.parquet', filters=filter)
   
   unique_chars = set()
   
   for fragment in dataset.fragments:
       # fragment = fragment.subset(filter)
       for batch in fragment.to_batches():
           df = batch.to_pandas()
           for character in df['character']:
               unique_chars.add(character)
   
   diff_chars = unique_chars.difference(charset)
   print("Characters in unique_chars but not in charset:", diff_chars)
   ```
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Python] `ParquetFileFragment.subset` filter ignored when chained with `to_batches` [arrow]

Reply via email to