[I] pyarrow dataset documentation inconsistent with default behavior [arrow]

via GitHub Mon, 13 Oct 2025 15:15:18 -0700


sidneymau opened a new issue, #47770:
URL: https://github.com/apache/arrow/issues/47770


   I noticed that `pyarrow.dataset.dataset` 
[docstring](https://github.com/apache/arrow/blob/5750e2932fc26c27be92fe9262f6b128a513abca/python/pyarrow/_dataset.pyx#L3244)
 says that the default for `exclude_invalid_files` is `True`. In code, the 
actual argument is set to `None`, which does _not_ get interpreted as `True`.
   
   Here's some very quick code to demonstrate this:
   ```
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   n_legs = pa.array([2, 4, 5, 100])
   animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
   names = ["n_legs", "animals"]
   
   table = pa.Table.from_arrays([n_legs, animals], names=names)
   
   ds.write_dataset(
       table,
       'dataset/parquet',
       format='parquet',
   )
   
   ds.write_dataset(
       table,
       'dataset/csv',
       format='csv',
   )
   
   # first, check that exclude_invalid_files works:
   parquet_dataset = ds.dataset("dataset", format="parquet", 
exclude_invalid_files=True)
   csv_dataset = ds.dataset("dataset", format="csv", exclude_invalid_files=True)
   
   assert parquet_dataset.to_table() == csv_dataset.to_table()
   
   # next, check that invalid files raise the appropriate error
   try:
       ds.dataset("dataset", format="parquet", exclude_invalid_files=False)
   except pa.ArrowInvalid:
       pass
   
   try:
       ds.dataset("dataset", format="csv", exclude_invalid_files=False)
   except pa.ArrowInvalid:
       pass
   
   # finally, check the default behavior -- this will raise pa.ArrowInvalid
   ds.dataset("dataset", format="parquet")
   
   ```
   And the resulting traceback:
   ```
   Traceback (most recent call last):
     File "/home/smau/Documents/scratch/pa_bug/test.py", line 40, in <module>
       ds.dataset("dataset", format="parquet")
       ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/smau/.conda/envs/roman-pipeline/lib/python3.13/site-packages/pyarrow/dataset.py",
 line 790, in dataset
       return _filesystem_dataset(source, **kwargs)
     File 
"/home/smau/.conda/envs/roman-pipeline/lib/python3.13/site-packages/pyarrow/dataset.py",
 line 482, in _filesystem_dataset
       return factory.finish(schema)
              ~~~~~~~~~~~~~~^^^^^^^^
     File "pyarrow/_dataset.pyx", line 3196, in 
pyarrow._dataset.DatasetFactory.finish
     File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 
'dataset/csv/part-0.csv'. Is this a 'parquet' file?: Could not open Parquet 
input source 'dataset/csv/part-0.csv': Parquet magic bytes not found in footer. 
Either the file is corrupted or this is not a parquet file.
   ```
   
   I think the docstring just needs to be updated to say that the default is 
`False`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] pyarrow dataset documentation inconsistent with default behavior [arrow]

Reply via email to