[I] How to project and filter a Dataset simultaneously (out-of-core)? [arrow]

via GitHub Wed, 11 Jun 2025 08:07:01 -0700


JosephWagner opened a new issue, #46781:
URL: https://github.com/apache/arrow/issues/46781


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi all,
   
   I'm uncertain how to filter a Dataset.scanner using projected/transformed 
columns that are not present in the underlying data. I know that I can create a 
table as an intermediate step, but I am working with data larger than memory so 
I'd like to keep things out-of-core if possible. What is the idiomatic way to 
do this?
   
   My best attempt so far is to separate the projection and filtering into 2 
separate actions. Am I correct in thinking that creating a projected 
RecordBatchReader and then a scanner from that will result in out-of-core 
projection and filtering?
   
   ```
   from pyarrow import compute as pc
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   table = pa.table({'year': [2019, 2020, 2021]})
   pq.write_table(table, "dataset_scanner.parquet")
   dataset = ds.dataset("dataset_scanner.parquet")
   
   print("filtered dataset year+1> 2020:")
   print("attempt chained scanning to compute year_plus_one first")
   projected = dataset.scanner(columns={"year": ds.field("year"), 
"year_plus_one": pc.add(ds.field("year"), 1)}).to_reader()
   print(ds.Scanner.from_batches(projected, filter=ds.field("year_plus_one") > 
2020).to_table())
   
   ##print("Filtered dataset year+1> 2020:")
   ##print("fails because year_plus_one does not exist in schema to filter 
with")
   ##print(dataset.scanner(columns={"year": ds.field("year"), "year_plus_one": 
pc.add(ds.field("year"), 1)}, filter=ds.field("year_plus_one") > 
2020).to_table())
   ```
   
   My first attempt (commented out above) tried to do everything in one 
operation and it returns this error:
   ```
   File "pyarrow/_dataset.pyx", line 415, in pyarrow._dataset.Dataset.scanner
     File "pyarrow/_dataset.pyx", line 3676, in 
pyarrow._dataset.Scanner.from_dataset
     File "pyarrow/_dataset.pyx", line 3589, in 
pyarrow._dataset.Scanner._make_scan_options
     File "pyarrow/_dataset.pyx", line 3519, in 
pyarrow._dataset._populate_builder
     File "pyarrow/_compute.pyx", line 2864, in pyarrow._compute._bind
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(year_plus_one) in year: 
int64
   ```
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] How to project and filter a Dataset simultaneously (out-of-core)? [arrow]

Reply via email to