sidneymau opened a new issue, #47770: URL: https://github.com/apache/arrow/issues/47770
I noticed that `pyarrow.dataset.dataset` [docstring](https://github.com/apache/arrow/blob/5750e2932fc26c27be92fe9262f6b128a513abca/python/pyarrow/_dataset.pyx#L3244) says that the default for `exclude_invalid_files` is `True`. In code, the actual argument is set to `None`, which does _not_ get interpreted as `True`. Here's some very quick code to demonstrate this: ``` import pyarrow as pa import pyarrow.dataset as ds n_legs = pa.array([2, 4, 5, 100]) animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"]) names = ["n_legs", "animals"] table = pa.Table.from_arrays([n_legs, animals], names=names) ds.write_dataset( table, 'dataset/parquet', format='parquet', ) ds.write_dataset( table, 'dataset/csv', format='csv', ) # first, check that exclude_invalid_files works: parquet_dataset = ds.dataset("dataset", format="parquet", exclude_invalid_files=True) csv_dataset = ds.dataset("dataset", format="csv", exclude_invalid_files=True) assert parquet_dataset.to_table() == csv_dataset.to_table() # next, check that invalid files raise the appropriate error try: ds.dataset("dataset", format="parquet", exclude_invalid_files=False) except pa.ArrowInvalid: pass try: ds.dataset("dataset", format="csv", exclude_invalid_files=False) except pa.ArrowInvalid: pass # finally, check the default behavior -- this will raise pa.ArrowInvalid ds.dataset("dataset", format="parquet") ``` And the resulting traceback: ``` Traceback (most recent call last): File "/home/smau/Documents/scratch/pa_bug/test.py", line 40, in <module> ds.dataset("dataset", format="parquet") ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/smau/.conda/envs/roman-pipeline/lib/python3.13/site-packages/pyarrow/dataset.py", line 790, in dataset return _filesystem_dataset(source, **kwargs) File "/home/smau/.conda/envs/roman-pipeline/lib/python3.13/site-packages/pyarrow/dataset.py", line 482, in _filesystem_dataset return factory.finish(schema) ~~~~~~~~~~~~~~^^^^^^^^ File "pyarrow/_dataset.pyx", line 3196, in pyarrow._dataset.DatasetFactory.finish File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 'dataset/csv/part-0.csv'. Is this a 'parquet' file?: Could not open Parquet input source 'dataset/csv/part-0.csv': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file. ``` I think the docstring just needs to be updated to say that the default is `False` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
