justinmclean opened a new issue, #10603:
URL: https://github.com/apache/gravitino/issues/10603

   ### What would you like to be improved?
   
   LancePartitionStatisticStorage defines the table_id Arrow column as a 64-bit 
integer, but createFragmentMetadata() retrieves it as UInt8Vector and writes a 
Long table ID into that vector.
   
   Relevant locations: LancePartitionStatisticStorage.java (line 114) and 
LancePartitionStatisticStorage.java (line 388)
   
   This makes the Lance-backed partition statistics update path inconsistent 
with its own schema and can fail at runtime when statistics are written.
   
   ### How should we improve?
   
   Use the Arrow vector type that matches the declared schema for table_id, 
such as BigIntVector, instead of UInt8Vector.
   
   Here's a unit test to help:
   ```
   @Test
     public void testUpdateStatisticsWithLargeTableId() throws Exception {
       PartitionStatisticStorageFactory factory = new 
LancePartitionStatisticStorageFactory();
       String metalakeName = "metalake";
       MetadataObject metadataObject =
           MetadataObjects.of(
               Lists.newArrayList("catalog", "schema", "table"), 
MetadataObject.Type.TABLE);
   
       EntityStore entityStore = mock(EntityStore.class);
       TableEntity tableEntity = mock(TableEntity.class);
       when(entityStore.get(any(), any(), any())).thenReturn(tableEntity);
       when(tableEntity.id()).thenReturn(256L);
       FieldUtils.writeField(GravitinoEnv.getInstance(), "entityStore", 
entityStore, true);
   
       String location = 
Files.createTempDirectory("lance_stats_large_table_id").toString();
       Map<String, String> properties = Maps.newHashMap();
       properties.put("location", location);
   
       LancePartitionStatisticStorage storage =
           (LancePartitionStatisticStorage) factory.create(properties);
       try {
         Map<String, StatisticValue<?>> statistics = Maps.newHashMap();
         statistics.put("statistic0", StatisticValues.stringValue("value0"));
   
         storage.updateStatistics(
             metalakeName,
             Lists.newArrayList(
                 MetadataObjectStatisticsUpdate.of(
                     metadataObject,
                     Lists.newArrayList(
                         PartitionStatisticsModification.update("partition0", 
statistics)))));
   
         List<PersistedPartitionStatistics> listedStats =
             storage.listStatistics(
                 metalakeName,
                 metadataObject,
                 PartitionRange.between(
                     "partition0",
                     PartitionRange.BoundType.CLOSED,
                     "partition0",
                     PartitionRange.BoundType.CLOSED));
   
         Assertions.assertEquals(1, listedStats.size());
         Assertions.assertEquals("partition0", 
listedStats.get(0).partitionName());
         Assertions.assertEquals(1, listedStats.get(0).statistics().size());
         Assertions.assertEquals("statistic0", 
listedStats.get(0).statistics().get(0).name());
         Assertions.assertEquals("value0", 
listedStats.get(0).statistics().get(0).value().value());
       } finally {
         FileUtils.deleteDirectory(new File(location + "/" + tableEntity.id() + 
".lance"));
         storage.close();
       }
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to