jkolash commented on issue #13297:
URL: https://github.com/apache/iceberg/issues/13297#issuecomment-2963910701

   I was able to make some hacky changes that reduced memory usage in this 
draft PR. https://github.com/apache/iceberg/pull/13298 mainly to show that 
these were the critical objects that needed to be GC'd that were no longer 
needed.
   
   I also tested
   
   ```java
      String[] paths = sparkSession.sql("select file_path from 
"+table+".files").collectAsList().stream().map( row -> 
row.getString(0)).toArray(String[]::new);
           System.out.println(Arrays.asList(paths));
           
sparkSession.read().load(paths).coalesce(4).foreachPartition((iterator) -> {
                       int partition = partitionCounter.getAndIncrement();
                       AtomicLong rowCounter = ThreadLocal.withInitial(() -> 
new AtomicLong(0)).get();
   
                       while (iterator.hasNext()) {
                           iterator.next();
                           if (rowCounter.getAndIncrement() % 100000 == 0) {
                               System.out.println(partition + " " + 
rowCounter.get());
                           }
                       }
                   }
           );
   ```
   
   To show that it is the new read path through iceberg that is the issue and 
when loading/processing the parquet files directly the issue doesn't manifest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to