singhpk234 commented on code in PR #7279:
URL: https://github.com/apache/iceberg/pull/7279#discussion_r1160094891
##########
parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java:
##########
@@ -154,21 +177,47 @@ public T next() {
}
private void advance() {
- while (shouldSkip[nextRowGroup]) {
- nextRowGroup += 1;
- reader.skipNextRowGroup();
- }
- PageReadStore pages;
try {
- pages = reader.readNextRowGroup();
- } catch (IOException e) {
- throw new RuntimeIOException(e);
+ Preconditions.checkNotNull(prefetchRowGroupFuture, "future should not
be null");
+ PageReadStore pages = prefetchRowGroupFuture.get();
+
+ if (prefetchedRowGroup >= totalRowGroups) {
+ return;
+ }
+ Preconditions.checkState(
+ pages != null,
+ "advance() should have been only when there was at least one row
group to read");
+ long rowPosition = rowGroupsStartRowPos[prefetchedRowGroup];
+ model.setRowGroupInfo(pages,
columnChunkMetadata.get(prefetchedRowGroup), rowPosition);
+ nextRowGroupStart += pages.getRowCount();
+ prefetchedRowGroup += 1;
+ prefetchNextRowGroup(); // eagerly fetch the next row group
Review Comment:
Agree with you, we can certainly do that, at the moment just kept it simple
and closer to the existing poc code. Let me add this in later revisions if
folks are onboard.
Though we might not wanna load all the row-groups at one shot, as it might
cause some memory pressure, may be having a conf for controlling the same would
be helpful.
P.S. Will also add pre-fetch of parquet data pages as well shortly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]