jameshowison opened a new issue, #46391:
URL: https://github.com/apache/arrow/issues/46391

   We are seeing unexpected behavior with arrow using dplyr `filter` and `<` 
(less than) when using `open_dataset` (to single file, not sharded) when the 
code works fine with `read_parquet`  
   
   We asked the issue on stackoverflow here: 
https://stackoverflow.com/questions/79607580/how-to-properly-use-less-than-in-a-dplyr-filter-of-a-sharded-arrow-dataset#comment140408196_79607580
   
   I've created a test dataset and code at: 
https://github.com/softcite/softcite-extractions-parquet-analysis in the 
https://github.com/softcite/softcite-extractions-parquet-analysis/blob/main/analysis/parquet_lessthan_issue.R
 file.
   
   I don't know how to debug further, tried a discussion. I didn't think that 
posit forums would help, since I think the arrow/parquet versions of the dplyr 
verbs are implemented here?
   
   Only things different I can see is that the datatypes of `open_dataset` 
(uint16) are different than `read_parquet` (int).  But I don't think that 
should influence how < less than operates?  I don't know how to debug further 
(down into C++ :)
   
   Can anyone replicate this with the dataset and code at 
https://github.com/softcite/softcite-extractions-parquet-analysis
   
   _Originally posted by @jameshowison in 
https://github.com/apache/arrow/discussions/46383_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to