[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7279: [Parquet] Eagerly fetch row groups when reading parquet

via GitHub Thu, 06 Apr 2023 10:51:19 -0700


singhpk234 commented on code in PR #7279:
URL: https://github.com/apache/iceberg/pull/7279#discussion_r1160094891



##########
parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java:
##########
@@ -154,21 +177,47 @@ public T next() {
     }
 
     private void advance() {
-      while (shouldSkip[nextRowGroup]) {
-        nextRowGroup += 1;
-        reader.skipNextRowGroup();
-      }
-      PageReadStore pages;
       try {
-        pages = reader.readNextRowGroup();
-      } catch (IOException e) {
-        throw new RuntimeIOException(e);
+        Preconditions.checkNotNull(prefetchRowGroupFuture, "future should not 
be null");
+        PageReadStore pages = prefetchRowGroupFuture.get();
+
+        if (prefetchedRowGroup >= totalRowGroups) {
+          return;
+        }
+        Preconditions.checkState(
+            pages != null,
+            "advance() should have been only when there was at least one row 
group to read");
+        long rowPosition = rowGroupsStartRowPos[prefetchedRowGroup];
+        model.setRowGroupInfo(pages, 
columnChunkMetadata.get(prefetchedRowGroup), rowPosition);
+        nextRowGroupStart += pages.getRowCount();
+        prefetchedRowGroup += 1;
+        prefetchNextRowGroup(); // eagerly fetch the next row group

Review Comment:
   Agree with you, we can certainly do that, at the moment just kept it simple 
and closer to the existing poc code. Let me add this in later revisions if 
folks are onboard.
   
   Though we might not wanna load all the row-groups at one shot, as it might 
cause some memory pressure, may be having a conf for controlling the same would 
be helpful.
   
   P.S. Will also add pre-fetch of parquet data pages as well shortly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7279: [Parquet] Eagerly fetch row groups when reading parquet

Reply via email to