[GitHub] [arrow] IamJeffG opened a new issue, #37944: HivePartitioning.discover() returning unexpected columns when reading a non-partitioned dataset

via GitHub Thu, 28 Sep 2023 14:44:10 -0700


IamJeffG opened a new issue, #37944:
URL: https://github.com/apache/arrow/issues/37944


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I get a lot of good use out of this pattern to automatically read & parse 
HivePartitions out of a directory structure:
   
       dataset = ds.dataset(path, partitioning=ds.HivePartitioning.discover())
       dataset.partitioning.schema  # lists the directory-based partitions
   
   It seems to work great when the dataset being read is, in fact, 
Hive-Partitioned. However, it gives an unexpected result when I run this code 
over a non-partitioned dataset. 
   
   **Minimum reproducible example:**
   
   ```py
   import os
   import pandas as pd
   import pyarrow.dataset as ds
   
   # Create a dataset. One file in one folder.
   root ="/tmp/example_dataset/"
   os.makedirs(root, exist_ok=True)
   df = pd.DataFrame([['alice', 40], ['bob', 22], ['carlos', 50]], 
columns=['Name', 'Age'])
   df.to_csv(os.path.join(root, "part-0.csv"), index=False)
   
   dataset = ds.dataset(root, format="csv", 
partitioning=ds.HivePartitioning.discover())
   if dataset.partitioning.schema.names != []:
       raise AssertionError(f"Read unexpected HivePartitioning 
{dataset.partitioning.schema.names}")
   ```
   
   Note you can also reproduce this with non-partitioned Parquet datasets 
(`format="parquet"`); not only CSVs.
   
   **Expected behavior:** The example dataset does not use hive-partitioning, 
so I expect `dataset.partitioning.schema.names` to be the empty list.
   
   **Actual behavior:** Instead, it is all the columns inside a single CSV 
fragment:
   
       AssertionError: Read unexpected HivePartitioning ['Name', 'Age']
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] IamJeffG opened a new issue, #37944: HivePartitioning.discover() returning unexpected columns when reading a non-partitioned dataset

Reply via email to