Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-11-22 Thread via GitHub
github-actions[bot] commented on PR #11258: URL: https://github.com/apache/iceberg/pull/11258#issuecomment-2495133577 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pul

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-23 Thread via GitHub
jinyangli34 commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1813328284 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -66,6 +66,9 @@ class ParquetWriter implements FileAppender, Closeable { private b

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-23 Thread via GitHub
nastra commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1812448383 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -66,6 +66,9 @@ class ParquetWriter implements FileAppender, Closeable { private boolea

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-23 Thread via GitHub
nastra commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1812072515 ## parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java: ## @@ -219,6 +222,52 @@ public void testTwoLevelList() throws IOException { assertThat(recor

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-22 Thread via GitHub
jinyangli34 commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1811499959 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java: ## @@ -557,8 +557,8 @@ public void testBinPackCombineMixedFile

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-22 Thread via GitHub
jinyangli34 commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1811494439 ## parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java: ## @@ -219,6 +222,52 @@ public void testTwoLevelList() throws IOException { assertThat(

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-22 Thread via GitHub
jinyangli34 commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1811493992 ## parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java: ## @@ -219,6 +222,52 @@ public void testTwoLevelList() throws IOException { assertThat(

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-22 Thread via GitHub
jinyangli34 commented on PR #11258: URL: https://github.com/apache/iceberg/pull/11258#issuecomment-2430386830 > > This makes it difficult to estimate the current row group size, and result in creating much smaller row-group than `write.parquet.row-group-size-bytes` config > > @jinyan

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-22 Thread via GitHub
nastra commented on PR #11258: URL: https://github.com/apache/iceberg/pull/11258#issuecomment-2428459165 > This makes it difficult to estimate the current row group size, and result in creating much smaller row-group than `write.parquet.row-group-size-bytes` config @jinyangli34 is th

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-22 Thread via GitHub
nastra commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1810083290 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java: ## @@ -557,8 +557,8 @@ public void testBinPackCombineMixedFiles() {

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-22 Thread via GitHub
nastra commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1810082246 ## parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java: ## @@ -219,6 +222,52 @@ public void testTwoLevelList() throws IOException { assertThat(recor

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-22 Thread via GitHub
nastra commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1810073114 ## parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java: ## @@ -219,6 +222,52 @@ public void testTwoLevelList() throws IOException { assertThat(recor

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-21 Thread via GitHub
RussellSpitzer commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1809421376 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -66,6 +66,9 @@ class ParquetWriter implements FileAppender, Closeable { privat

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-21 Thread via GitHub
RussellSpitzer commented on PR #11258: URL: https://github.com/apache/iceberg/pull/11258#issuecomment-2427568709 I'm very suspicious of the Spark writer getting faster, do we have any explanation of why that is? Shouldn't it be using the same underlying code so any differences should be see

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-21 Thread via GitHub
RussellSpitzer commented on PR #11258: URL: https://github.com/apache/iceberg/pull/11258#issuecomment-2427571204 @nastra Can you please take a look at this as well? I want someone else who is familiar with the write path to double check -- This is an automated message from the Apache Git

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-21 Thread via GitHub
RussellSpitzer commented on PR #11258: URL: https://github.com/apache/iceberg/pull/11258#issuecomment-2427565585 To blackhole values you can use, https://javadoc.io/doc/org.openjdk.jmh/jmh-core/1.23/org/openjdk/jmh/infra/Blackhole.html -- This is an automated message from the Apache Git S

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-09 Thread via GitHub
jinyangli34 commented on PR #11258: URL: https://github.com/apache/iceberg/pull/11258#issuecomment-2403412692 Run benchmark again, increased `NUM_RECORDS` from 1M to 5M Tested 4 groups: **main**: main branch without change in this PR **PR**: this PR **PR+2**: two more getBuffe

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-08 Thread via GitHub
jinyangli34 commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1792875688 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -66,6 +66,9 @@ class ParquetWriter implements FileAppender, Closeable { private b

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-08 Thread via GitHub
jinyangli34 commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1792853328 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -211,6 +228,8 @@ private void flushRowGroup(boolean finished) { writer.star

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-08 Thread via GitHub
RussellSpitzer commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1792496212 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -132,7 +135,9 @@ private void ensureWriterInitialized() { @Override public

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-08 Thread via GitHub
RussellSpitzer commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1792493504 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -66,6 +66,9 @@ class ParquetWriter implements FileAppender, Closeable { privat

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-08 Thread via GitHub
RussellSpitzer commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1792492250 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -211,6 +228,8 @@ private void flushRowGroup(boolean finished) { writer.s

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-08 Thread via GitHub
RussellSpitzer commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1792476033 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -185,9 +190,21 @@ public List splitOffsets() { return null; } + /* +

Re: [PR] More accurate estimate on parquet row groups size [iceberg]

2024-10-08 Thread via GitHub
edgarRd commented on code in PR #11258: URL: https://github.com/apache/iceberg/pull/11258#discussion_r1792071351 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java: ## @@ -185,9 +190,17 @@ public List splitOffsets() { return null; } + private long