[I] Unexpected behavior with < on int field using dplyr filter (read_parquet vs open_dataset) [arrow]

via GitHub Sat, 10 May 2025 06:59:41 -0700


jameshowison opened a new issue, #46391:
URL: https://github.com/apache/arrow/issues/46391


   We are seeing unexpected behavior with arrow using dplyr `filter` and `<` 
(less than) when using `open_dataset` (to single file, not sharded) when the 
code works fine with `read_parquet`  
   
   We asked the issue on stackoverflow here: 
https://stackoverflow.com/questions/79607580/how-to-properly-use-less-than-in-a-dplyr-filter-of-a-sharded-arrow-dataset#comment140408196_79607580
   
   I've created a test dataset and code at: 
https://github.com/softcite/softcite-extractions-parquet-analysis in the 
https://github.com/softcite/softcite-extractions-parquet-analysis/blob/main/analysis/parquet_lessthan_issue.R
 file.
   
   I don't know how to debug further, tried a discussion. I didn't think that 
posit forums would help, since I think the arrow/parquet versions of the dplyr 
verbs are implemented here?
   
   Only things different I can see is that the datatypes of `open_dataset` 
(uint16) are different than `read_parquet` (int).  But I don't think that 
should influence how < less than operates?  I don't know how to debug further 
(down into C++ :)
   
   Can anyone replicate this with the dataset and code at 
https://github.com/softcite/softcite-extractions-parquet-analysis
   
   _Originally posted by @jameshowison in 
https://github.com/apache/arrow/discussions/46383_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Unexpected behavior with < on int field using dplyr filter (read_parquet vs open_dataset) [arrow]

Reply via email to