dberenbaum opened a new issue, #43497:
URL: https://github.com/apache/arrow/issues/43497

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Take the following example using a publicly available dataset:
   
   ```python
   import gcsfs
   from pyarrow.dataset import dataset
   
   # without fsspec filesystem, get segmentation fault
   fs = None 
   # with fsspec filesystem, hangs and never finishes
   # fs = gcsfs.GCSFileSystem()
   
   uri = 
"gs://datachain-demo/laion-aesthetics-csv/laion_aesthetics_1024_33M_1.csv"
   ds = dataset(uri, format="csv", filesystem=fs)
   print(ds.head(5))
   ```
   
   As noted in the comments, depending on which filesystem is passed, it will 
either hang indefinitely or hit a segmentation fault. Strangely, s3 paths work 
(don't hang or fail) with the pyarrow filesystem but hang with the fsspec s3fs 
filesystem.
   
   Other findings:
   - Similar operations like `ds.take()` and `next(ds.to_batches())` have the 
same behavior as `ds.head()`
   - `ds.head(use_threads=False)` completes successfully with any filesystem 
but takes much longer
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to