[I] [Parquet] support nested column selection for list-struct types when reading parquet files [arrow]

via GitHub Mon, 05 May 2025 16:04:30 -0700


hombit opened a new issue, #46329:
URL: https://github.com/apache/arrow/issues/46329


   ### Describe the enhancement requested
   
   Currently, `pyarrow.parquet.read_table(columns)` supports selection of 
nested columns with dot notation. For example, `columns=["a.b"]` will select 
field "b" from the struct column "a". It would be really nice if this also 
worked for list columns with struct values.
   
   Currently, this code fails:
   
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   list_struct = pa.ListArray.from_arrays(
       values = [{"b": 1, "c": 1}] * 10,
       offsets = [0, 4, 10],
   )
   pq.write_table(pa.table({"a": list_struct}), "/tmp/table.parquet")
   pq.read_table("/tmp/table.parquet", columns=["a.b"])
   ```
   
   <details><summary></summary>
   
   ```
   File 
~/.virtualenvs/v/lib/python3.13/site-packages/pyarrow/parquet/core.py:1824, in 
read_table(source, columns, use_threads, schema, use_pandas_metadata, 
read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, 
ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, 
decryption_properties, thrift_string_size_limit, thrift_container_size_limit, 
page_checksum_verification)
      1812     # TODO test that source is not a directory or a list
      1813     dataset = ParquetFile(
      1814         source, read_dictionary=read_dictionary,
      1815         memory_map=memory_map, buffer_size=buffer_size,
      (...)   1821         
page_checksum_verification=page_checksum_verification,
      1822     )
   -> 1824 return dataset.read(columns=columns, use_threads=use_threads,
      1825                     use_pandas_metadata=use_pandas_metadata)
   
   File 
~/.virtualenvs/v/lib/python3.13/site-packages/pyarrow/parquet/core.py:1475, in 
ParquetDataset.read(self, columns, use_threads, use_pandas_metadata)
      1467         index_columns = [
      1468             col for col in _get_pandas_index_columns(metadata)
      1469             if not isinstance(col, dict)
      1470         ]
      1471         columns = (
      1472             list(columns) + list(set(index_columns) - set(columns))
      1473         )
   -> 1475 table = self._dataset.to_table(
      1476     columns=columns, filter=self._filter_expression,
      1477     use_threads=use_threads
      1478 )
      1480 # if use_pandas_metadata, restore the pandas metadata (which gets
      1481 # lost if doing a specific `columns` selection in to_table)
      1482 if use_pandas_metadata:
   
   File ~/.virtualenvs/v/lib/python3.13/site-packages/pyarrow/_dataset.pyx:579, 
in pyarrow._dataset.Dataset.to_table()
   
   File ~/.virtualenvs/v/lib/python3.13/site-packages/pyarrow/_dataset.pyx:415, 
in pyarrow._dataset.Dataset.scanner()
   
   File 
~/.virtualenvs/v/lib/python3.13/site-packages/pyarrow/_dataset.pyx:3676, in 
pyarrow._dataset.Scanner.from_dataset()
   
   File 
~/.virtualenvs/v/lib/python3.13/site-packages/pyarrow/_dataset.pyx:3589, in 
pyarrow._dataset.Scanner._make_scan_options()
   
   File 
~/.virtualenvs/v/lib/python3.13/site-packages/pyarrow/_dataset.pyx:3539, in 
pyarrow._dataset._populate_builder()
   
   File ~/.virtualenvs/v/lib/python3.13/site-packages/pyarrow/error.pxi:92, in 
pyarrow.lib.check_status()
   
   ArrowInvalid: No match for FieldRef.Nested(FieldRef.Name(a) 
FieldRef.Name(b)) in a: list<element: struct<b: int64, c: int64>>
   __fragment_index: int32
   __batch_index: int32
   __last_in_fragment: bool
   __filename: string
   ```
   
   </details>
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Parquet] support nested column selection for list-struct types when reading parquet files [arrow]

Reply via email to