jkolash commented on issue #13297: URL: https://github.com/apache/iceberg/issues/13297#issuecomment-2963910701
I was able to make some hacky changes that reduced memory usage in this draft PR. https://github.com/apache/iceberg/pull/13298 mainly to show that these were the critical objects that needed to be GC'd that were no longer needed. I also tested ```java String[] paths = sparkSession.sql("select file_path from "+table+".files").collectAsList().stream().map( row -> row.getString(0)).toArray(String[]::new); System.out.println(Arrays.asList(paths)); sparkSession.read().load(paths).coalesce(4).foreachPartition((iterator) -> { int partition = partitionCounter.getAndIncrement(); AtomicLong rowCounter = ThreadLocal.withInitial(() -> new AtomicLong(0)).get(); while (iterator.hasNext()) { iterator.next(); if (rowCounter.getAndIncrement() % 100000 == 0) { System.out.println(partition + " " + rowCounter.get()); } } } ); ``` To show that it is the new read path through iceberg that is the issue and when loading/processing the parquet files directly the issue doesn't manifest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org