[GitHub] [iceberg] ChristinaTech opened a new issue, #7623: Spark shows incorrect count for incremental read in some cases.

via GitHub Tue, 16 May 2023 07:43:50 -0700


ChristinaTech opened a new issue, #7623:
URL: https://github.com/apache/iceberg/issues/7623


   ### Apache Iceberg version
   
   1.2.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   It has been discovered that when performing in incremental read in Iceberg 
1.2.1, if you call count on the incremental read DataFrame right after calling 
`load/table` it will return the count as though its not an incremental read 
DataFrame. This issue disappears if the DataFrame you call count on has any 
post-load operations such as `orderBy` on it. In addition, if you collect the 
contents of the unmodified DataFrame to a list or call `show` on the DataFrame 
the size of the contents is correct, making this a well hidden bug.
   
   We found this bug when one of our use case's unit tests failed while 
attempting to upgrade from Iceberg 1.1.0 to Iceberg 1.2.1, meaning this is a 
regression between those versions. We were able to replicate this in Iceberg's 
own unit tests, where we found it impacts Spark 3.3/3.4 but not Spark 3.1/3.2. 
Considering this only appears after upgrading to a newer Iceberg version, it 
seems more likely the issue is in Iceberg than Spark, and in addition that the 
reason Spark 3.1/3.2 are not impacted is it was likely an improvement that was 
not backported that caused the bug, but I have not tracked down what specific 
change causes the issue yet.
   
   I have provided Draft PR #7616 which contains the minimal change to the unit 
tests that replicates the issue and will attempt to narrow down the commit that 
causes the issue in the coming days as time permits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ChristinaTech opened a new issue, #7623: Spark shows incorrect count for incremental read in some cases.

Reply via email to