[I] pyarrow.dataset.dataset exclude_invalid_files parameter does not adhere to documented default value [arrow]

via GitHub Fri, 18 Apr 2025 23:07:55 -0700


chwiese opened a new issue, #46181:
URL: https://github.com/apache/arrow/issues/46181


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The PyArrow documentation suggests that the exclude_invalid_files parameter 
defaults to True for the dataset() function, but in practice, it appears to 
default to False. This causes the function to fail when encountering invalid 
Parquet files instead of skipping them.
   
   Here is a script to reproduce the issue, courtesy of an AI assistant:
   ```
   """
   PyArrow Dataset Bug Report: exclude_invalid_files parameter default
   
   Issue: PyArrow documentation states that exclude_invalid_files defaults to 
True,
   but testing shows it behaves as if the default is False.
   """
   
   import os
   import shutil
   import pyarrow as pa
   import pyarrow.dataset as ds
   import pyarrow.parquet as pq
   
   # Setup test directory
   test_dir = "/tmp/pyarrow_test"
   os.makedirs(test_dir, exist_ok=True)
   
   # Create a valid Parquet file
   valid_data = pa.table({"a": [1, 2, 3]})
   pq.write_table(valid_data, f"{test_dir}/valid.parquet")
   
   # Create an invalid "Parquet" file
   with open(f"{test_dir}/invalid.parquet", "w") as f:
       f.write("This is not a valid Parquet file")
   
   print("Test setup complete. Testing dataset() with invalid files...")
   
   # Test 1: Without specifying exclude_invalid_files (should use default)
   try:
       print("\nTEST 1: Using default parameter (documentation says 
default=True)")
       dataset = ds.dataset(test_dir, format="parquet")
       print("✓ Success! Dataset created with invalid files ignored")
       print(f"Files found: {len(dataset.files)}")
   except Exception as e:
       print(f"✗ Failed! Error: {e}")
       print("This indicates exclude_invalid_files actually defaults to False")
   
   # Test 2: Explicitly set exclude_invalid_files=True
   try:
       print("\nTEST 2: With exclude_invalid_files=True")
       dataset = ds.dataset(test_dir, format="parquet", 
exclude_invalid_files=True)
       print("✓ Success! Dataset created with invalid files ignored")
       print(f"Files found: {len(dataset.files)}")
   except Exception as e:
       print(f"✗ Failed! Error: {e}")
   
   # Test 3: Explicitly set exclude_invalid_files=False
   try:
       print("\nTEST 3: With exclude_invalid_files=False")
       dataset = ds.dataset(test_dir, format="parquet", 
exclude_invalid_files=False)
       # We don't expect this to succeed, but if it does:
       print("✓ Success! Dataset created despite invalid files")
   except Exception as e:
       print(f"✗ Failed as expected when handling invalid files: 
{type(e).__name__}")
   
   # Cleanup
   print("\nCleaning up test directory")
   shutil.rmtree(test_dir)
   
   print("""
   Bug Report Conclusion:
   ----------------------
   If Test 1 failed but Test 2 succeeded, this confirms that 
exclude_invalid_files
   actually defaults to False, contrary to what the documentation suggests.
   
   This is a documentation bug at minimum, and possibly a behavioral bug if the
   intention was for the parameter to default to True.
   """)
   
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] pyarrow.dataset.dataset exclude_invalid_files parameter does not adhere to documented default value [arrow]

Reply via email to