zhongyujiang commented on PR #10567:
URL: https://github.com/apache/iceberg/pull/10567#issuecomment-2243038965

   Hi @pvary Thanks for reviewing.
   
   I think the issue here is somewhat different from what you understand.
   
   > We have at least 3 FileScanTasks (FS1, FS2, FS3) to read
   > We have a filter which filters out every record from FS1
   > We have a failure after the reader already skipped reading FS1 (file 
offset is not increased), and started to read FS2 (file offset is increased)
   
   Yes, the record in FS1 is all filtered out, and then we have a ckpt when the 
reader is reading **FS3** with `fileOffset`=1(the `fileOffset` starts from 0 in 
the code, so this is incorrect), and then we have a failure.
   
   When the first call to `seek(int startingFileOffset=0, long 
startingRecordOffset=0)` is made, FS1 will be skipped.
   
   Please note that when executing `updateCurrentIterator`, since all records 
in FS1 can be skipped, when `currentIterator` points to FS1, `tasks.hasNext()` 
will return `false`, causing it to continue updating `currentIterator` to FS2.
   
   But at the end of the method, `fileOffset` is assigned to value 0, which is 
incorrect.
   
   ```
     private void updateCurrentIterator() {
       try {
         while (!currentIterator.hasNext() && tasks.hasNext()) {
           currentIterator.close();
           currentIterator = openTaskIterator(tasks.next());
           fileOffset += 1;
           recordOffset = 0L;
         }
       } catch (IOException e) {
         throw new UncheckedIOException(e);
       }
     }
   ```
   
   
   
   
   In your case, if the checkpoint occurs while reading FS2, the `fileOffset` 
in the checkpoint will be 0, which is still incorrect. However, when 
recovering, updateCurrentIterator() will skip FS1. By a stroke of luck, the 
file read after recovery will be correct.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to