jameshowison opened a new issue, #46391: URL: https://github.com/apache/arrow/issues/46391
We are seeing unexpected behavior with arrow using dplyr `filter` and `<` (less than) when using `open_dataset` (to single file, not sharded) when the code works fine with `read_parquet` We asked the issue on stackoverflow here: https://stackoverflow.com/questions/79607580/how-to-properly-use-less-than-in-a-dplyr-filter-of-a-sharded-arrow-dataset#comment140408196_79607580 I've created a test dataset and code at: https://github.com/softcite/softcite-extractions-parquet-analysis in the https://github.com/softcite/softcite-extractions-parquet-analysis/blob/main/analysis/parquet_lessthan_issue.R file. I don't know how to debug further, tried a discussion. I didn't think that posit forums would help, since I think the arrow/parquet versions of the dplyr verbs are implemented here? Only things different I can see is that the datatypes of `open_dataset` (uint16) are different than `read_parquet` (int). But I don't think that should influence how < less than operates? I don't know how to debug further (down into C++ :) Can anyone replicate this with the dataset and code at https://github.com/softcite/softcite-extractions-parquet-analysis _Originally posted by @jameshowison in https://github.com/apache/arrow/discussions/46383_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org