IamJeffG opened a new issue, #37944:
URL: https://github.com/apache/arrow/issues/37944
### Describe the bug, including details regarding any error messages,
version, and platform.
I get a lot of good use out of this pattern to automatically read & parse
HivePartitions out of a directory structure:
dataset = ds.dataset(path, partitioning=ds.HivePartitioning.discover())
dataset.partitioning.schema # lists the directory-based partitions
It seems to work great when the dataset being read is, in fact,
Hive-Partitioned. However, it gives an unexpected result when I run this code
over a non-partitioned dataset.
**Minimum reproducible example:**
```py
import os
import pandas as pd
import pyarrow.dataset as ds
# Create a dataset. One file in one folder.
root ="/tmp/example_dataset/"
os.makedirs(root, exist_ok=True)
df = pd.DataFrame([['alice', 40], ['bob', 22], ['carlos', 50]],
columns=['Name', 'Age'])
df.to_csv(os.path.join(root, "part-0.csv"), index=False)
dataset = ds.dataset(root, format="csv",
partitioning=ds.HivePartitioning.discover())
if dataset.partitioning.schema.names != []:
raise AssertionError(f"Read unexpected HivePartitioning
{dataset.partitioning.schema.names}")
```
Note you can also reproduce this with non-partitioned Parquet datasets
(`format="parquet"`); not only CSVs.
**Expected behavior:** The example dataset does not use hive-partitioning,
so I expect `dataset.partitioning.schema.names` to be the empty list.
**Actual behavior:** Instead, it is all the columns inside a single CSV
fragment:
AssertionError: Read unexpected HivePartitioning ['Name', 'Age']
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]