pitrou opened a new issue, #47030:
URL: https://github.com/apache/arrow/issues/47030

   ### Describe the enhancement requested
   
   Both the Rust and Java implementations limit the number of rows written per 
page:
   * Rust: 
https://github.com/apache/arrow-rs/blob/3126dad0348035bc5fadc8ec61b7150b9ce6aad5/parquet/src/file/properties.rs#L42
   * Java: 
https://github.com/apache/parquet-java/blob/4aa2ea91863274aebb1eded243ce275912c16010/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L61
   
   They do this in addition to trying to keep the page size under 1 MB. This 
allows keeping the actual page size to a much smaller value.
   
   However, in Parquet C++ we only have the 1 MB page size limit, but do not 
limit the number of rows written. This can result in much larger pages than 
with other implementations.
   
   Large pages can have several problems:
   1) less CPU cache efficiency when reading, decompressing, etc.
   2) less fine-grained page pruning using predicate pushdown
   3) larger intermediate buffers, leading to a significant [increase in memory 
consumption](https://github.com/apache/arrow/issues/46971) if there are many 
columns to read
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to