shachibista opened a new issue, #374:
URL: https://github.com/apache/arrow-go/issues/374

   ### Describe the usage question you have. Please include as many useful 
details as possible.
   
   
   I am trying to set row groups based on size instead of number of rows but I 
cannot really figure out how.
   
   [Parquet specification 
recommends](https://parquet.apache.org/docs/file-format/configurations/) a 
row-group between 512MB-1GB. Setting this through the number of rows is a bit 
unintuitive since records can be of varying sizes and would need some heuristic 
for estimation.
   
   Is there a way to specify row groups based on sizes? I see that there is a 
[comment on 
`WriteBuffered()`](https://github.com/apache/arrow-go/blob/ec15aba303a09c81530d1741f69ba4e45d1e9cf7/parquet/pqarrow/file_writer.go#L157)
   
   ```
   // Additionally, it allows to manually break your row group by
   // checking RowGroupTotalBytesWritten and calling NewBufferedRowGroup,
   // while Write will always create at least 1 row group for the record.
   ```
   
   However I cannot get it to work, my approach is along the lines of:
   
   ```golang
   wr, _ := pqarrow.NewFileWriter(
        schema,
        out,
        parquet.NewWriterProperties(
                parquet.WithCompression(compress.Codecs.Snappy),
                parquet.WithDataPageVersion(parquet.DataPageV2),
                parquet.WithVersion(parquet.V2_LATEST),
        ),
        pqarrow.NewArrowWriterProperties(pqarrow.WithStoreSchema()),
   )
   
   for record := range records {
     maxRowGroupSize := 512 * 1024 * 1024
     if wr.RowGroupTotalBytesWritten() >= maxRowGroupSize {
        w.out.NewBufferedRowGroup()
     }
     
     wr.WriteBuffered(record)
   }
   ```
   
   The output file is around 351MB however when inspecting the column metadata 
I see that `row_group_bytes` is around 1GB. 
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to