JosephWagner opened a new issue, #46781: URL: https://github.com/apache/arrow/issues/46781
### Describe the usage question you have. Please include as many useful details as possible. Hi all, I'm uncertain how to filter a Dataset.scanner using projected/transformed columns that are not present in the underlying data. I know that I can create a table as an intermediate step, but I am working with data larger than memory so I'd like to keep things out-of-core if possible. What is the idiomatic way to do this? My best attempt so far is to separate the projection and filtering into 2 separate actions. Am I correct in thinking that creating a projected RecordBatchReader and then a scanner from that will result in out-of-core projection and filtering? ``` from pyarrow import compute as pc import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds table = pa.table({'year': [2019, 2020, 2021]}) pq.write_table(table, "dataset_scanner.parquet") dataset = ds.dataset("dataset_scanner.parquet") print("filtered dataset year+1> 2020:") print("attempt chained scanning to compute year_plus_one first") projected = dataset.scanner(columns={"year": ds.field("year"), "year_plus_one": pc.add(ds.field("year"), 1)}).to_reader() print(ds.Scanner.from_batches(projected, filter=ds.field("year_plus_one") > 2020).to_table()) ##print("Filtered dataset year+1> 2020:") ##print("fails because year_plus_one does not exist in schema to filter with") ##print(dataset.scanner(columns={"year": ds.field("year"), "year_plus_one": pc.add(ds.field("year"), 1)}, filter=ds.field("year_plus_one") > 2020).to_table()) ``` My first attempt (commented out above) tried to do everything in one operation and it returns this error: ``` File "pyarrow/_dataset.pyx", line 415, in pyarrow._dataset.Dataset.scanner File "pyarrow/_dataset.pyx", line 3676, in pyarrow._dataset.Scanner.from_dataset File "pyarrow/_dataset.pyx", line 3589, in pyarrow._dataset.Scanner._make_scan_options File "pyarrow/_dataset.pyx", line 3519, in pyarrow._dataset._populate_builder File "pyarrow/_compute.pyx", line 2864, in pyarrow._compute._bind File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(year_plus_one) in year: int64 ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org