chwiese opened a new issue, #46181: URL: https://github.com/apache/arrow/issues/46181
### Describe the bug, including details regarding any error messages, version, and platform. The PyArrow documentation suggests that the exclude_invalid_files parameter defaults to True for the dataset() function, but in practice, it appears to default to False. This causes the function to fail when encountering invalid Parquet files instead of skipping them. Here is a script to reproduce the issue, courtesy of an AI assistant: ``` """ PyArrow Dataset Bug Report: exclude_invalid_files parameter default Issue: PyArrow documentation states that exclude_invalid_files defaults to True, but testing shows it behaves as if the default is False. """ import os import shutil import pyarrow as pa import pyarrow.dataset as ds import pyarrow.parquet as pq # Setup test directory test_dir = "/tmp/pyarrow_test" os.makedirs(test_dir, exist_ok=True) # Create a valid Parquet file valid_data = pa.table({"a": [1, 2, 3]}) pq.write_table(valid_data, f"{test_dir}/valid.parquet") # Create an invalid "Parquet" file with open(f"{test_dir}/invalid.parquet", "w") as f: f.write("This is not a valid Parquet file") print("Test setup complete. Testing dataset() with invalid files...") # Test 1: Without specifying exclude_invalid_files (should use default) try: print("\nTEST 1: Using default parameter (documentation says default=True)") dataset = ds.dataset(test_dir, format="parquet") print("✓ Success! Dataset created with invalid files ignored") print(f"Files found: {len(dataset.files)}") except Exception as e: print(f"✗ Failed! Error: {e}") print("This indicates exclude_invalid_files actually defaults to False") # Test 2: Explicitly set exclude_invalid_files=True try: print("\nTEST 2: With exclude_invalid_files=True") dataset = ds.dataset(test_dir, format="parquet", exclude_invalid_files=True) print("✓ Success! Dataset created with invalid files ignored") print(f"Files found: {len(dataset.files)}") except Exception as e: print(f"✗ Failed! Error: {e}") # Test 3: Explicitly set exclude_invalid_files=False try: print("\nTEST 3: With exclude_invalid_files=False") dataset = ds.dataset(test_dir, format="parquet", exclude_invalid_files=False) # We don't expect this to succeed, but if it does: print("✓ Success! Dataset created despite invalid files") except Exception as e: print(f"✗ Failed as expected when handling invalid files: {type(e).__name__}") # Cleanup print("\nCleaning up test directory") shutil.rmtree(test_dir) print(""" Bug Report Conclusion: ---------------------- If Test 1 failed but Test 2 succeeded, this confirms that exclude_invalid_files actually defaults to False, contrary to what the documentation suggests. This is a documentation bug at minimum, and possibly a behavioral bug if the intention was for the parameter to default to True. """) ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org