shachibista opened a new issue, #374: URL: https://github.com/apache/arrow-go/issues/374
### Describe the usage question you have. Please include as many useful details as possible. I am trying to set row groups based on size instead of number of rows but I cannot really figure out how. [Parquet specification recommends](https://parquet.apache.org/docs/file-format/configurations/) a row-group between 512MB-1GB. Setting this through the number of rows is a bit unintuitive since records can be of varying sizes and would need some heuristic for estimation. Is there a way to specify row groups based on sizes? I see that there is a [comment on `WriteBuffered()`](https://github.com/apache/arrow-go/blob/ec15aba303a09c81530d1741f69ba4e45d1e9cf7/parquet/pqarrow/file_writer.go#L157) ``` // Additionally, it allows to manually break your row group by // checking RowGroupTotalBytesWritten and calling NewBufferedRowGroup, // while Write will always create at least 1 row group for the record. ``` However I cannot get it to work, my approach is along the lines of: ```golang wr, _ := pqarrow.NewFileWriter( schema, out, parquet.NewWriterProperties( parquet.WithCompression(compress.Codecs.Snappy), parquet.WithDataPageVersion(parquet.DataPageV2), parquet.WithVersion(parquet.V2_LATEST), ), pqarrow.NewArrowWriterProperties(pqarrow.WithStoreSchema()), ) for record := range records { maxRowGroupSize := 512 * 1024 * 1024 if wr.RowGroupTotalBytesWritten() >= maxRowGroupSize { w.out.NewBufferedRowGroup() } wr.WriteBuffered(record) } ``` The output file is around 351MB however when inspecting the column metadata I see that `row_group_bytes` is around 1GB. ### Component(s) Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org