etseidl opened a new issue, #34086:
URL: https://github.com/apache/arrow/issues/34086

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When writing Parquet files with version 2 page headers, the `num_rows` field 
is incorrect.  This appears to be because in `column_writer.cc 
ColumnWriterImpl::BuildDataPageV2()` `num_values` is passed twice to the 
constructor for `DataPageV2`.  The 4th argument should be `num_rows`.
   
   To reproduce:
   ```python
   import pyarrow.parquet as pq
   import pyarrow as pa
   table = pa.table({'col0': [[1,2,3]]})
   pq.write_table(table, 'bug.parquet', data_page_version="2.0")
   ```
   Examining with parquet-cli:
   ```sh
   % parquet-cli pages bug.parquet
   Column: col0.list.item
   
--------------------------------------------------------------------------------
     page   type  enc  count   avg size   size       rows     nulls   min / max
     0-D    dict  S _  3       8.00 B     24 B      
     0-1    data  _ R  3       2.67 B     8 B        3        0       "1" / "3"
   ```
   "rows" should be 1.
   Rewriting the file with parquet-mr gives:
   ```sh
   % parquet-cli pages bug-mr.parquet
   Column: col0.list.element
   
--------------------------------------------------------------------------------
     page   type  enc  count   avg size   size       rows     nulls   min / max
     0-0    data  _ D  3       5.00 B     15 B       1        0       
   ```
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to