chwiese opened a new issue, #46181:
URL: https://github.com/apache/arrow/issues/46181
### Describe the bug, including details regarding any error messages,
version, and platform.
The PyArrow documentation suggests that the exclude_invalid_files parameter
defaults to True for the dataset() function, but in practice, it appears to
default to False. This causes the function to fail when encountering invalid
Parquet files instead of skipping them.
Here is a script to reproduce the issue, courtesy of an AI assistant:
```
"""
PyArrow Dataset Bug Report: exclude_invalid_files parameter default
Issue: PyArrow documentation states that exclude_invalid_files defaults to
True,
but testing shows it behaves as if the default is False.
"""
import os
import shutil
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
# Setup test directory
test_dir = "/tmp/pyarrow_test"
os.makedirs(test_dir, exist_ok=True)
# Create a valid Parquet file
valid_data = pa.table({"a": [1, 2, 3]})
pq.write_table(valid_data, f"{test_dir}/valid.parquet")
# Create an invalid "Parquet" file
with open(f"{test_dir}/invalid.parquet", "w") as f:
f.write("This is not a valid Parquet file")
print("Test setup complete. Testing dataset() with invalid files...")
# Test 1: Without specifying exclude_invalid_files (should use default)
try:
print("\nTEST 1: Using default parameter (documentation says
default=True)")
dataset = ds.dataset(test_dir, format="parquet")
print("✓ Success! Dataset created with invalid files ignored")
print(f"Files found: {len(dataset.files)}")
except Exception as e:
print(f"✗ Failed! Error: {e}")
print("This indicates exclude_invalid_files actually defaults to False")
# Test 2: Explicitly set exclude_invalid_files=True
try:
print("\nTEST 2: With exclude_invalid_files=True")
dataset = ds.dataset(test_dir, format="parquet",
exclude_invalid_files=True)
print("✓ Success! Dataset created with invalid files ignored")
print(f"Files found: {len(dataset.files)}")
except Exception as e:
print(f"✗ Failed! Error: {e}")
# Test 3: Explicitly set exclude_invalid_files=False
try:
print("\nTEST 3: With exclude_invalid_files=False")
dataset = ds.dataset(test_dir, format="parquet",
exclude_invalid_files=False)
# We don't expect this to succeed, but if it does:
print("✓ Success! Dataset created despite invalid files")
except Exception as e:
print(f"✗ Failed as expected when handling invalid files:
{type(e).__name__}")
# Cleanup
print("\nCleaning up test directory")
shutil.rmtree(test_dir)
print("""
Bug Report Conclusion:
----------------------
If Test 1 failed but Test 2 succeeded, this confirms that
exclude_invalid_files
actually defaults to False, contrary to what the documentation suggests.
This is a documentation bug at minimum, and possibly a behavioral bug if the
intention was for the parameter to default to True.
""")
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]