[I] ParquetDataset object fails with a .read() method due to hive partition schema columns. [arrow]

via GitHub Wed, 22 May 2024 07:18:25 -0700


j0bekt01 opened a new issue, #41779:
URL: https://github.com/apache/arrow/issues/41779


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I'm trying to read parquet files from S3 that have a Hive partition 
'/year=YYYY/month=MM/day=DD/hour=HH/' using the .read() method, but it fails, 
stating that one of the partition columns doesn't exist. However, if I exclude 
the partition columns and provide a list of columns that are actually present 
in the file, it reads without any issues. According to the documentation, the 
read() method should ignore Hive partition columns.
   
   `import pyarrow.parquet as pq
   import datetime
   import polars as pl
   
   dt = datetime.datetime(2024, 5, 17)
   path = f"{bucket}/folder-to-files/year={dt.year}/month={dt.month:02d}/"
   dataset = pq.ParquetDataset(path, partitioning='hive', 
filesystem=s3fs.S3FileSystem())
   
   # This Fails
   (
       pl.LazyFrame(dataset.read()) 
         .select(pl.all()) 
         .head(100)
         .collect()
   )
   
   # Remove the partition columns
   cols = dataset.schema.names
   [cols.remove(item) for item in ['year','month', 'day', 'hour'] if item in 
cols]
   
   (
       pl.LazyFrame(dataset.read()) 
         .select(pl.all()) 
         .head(100)
         .collect()
   )
   `
   windows 11
   python 3.10
   pyarrow 16.1.0
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] ParquetDataset object fails with a .read() method due to hive partition schema columns. [arrow]

Reply via email to