[I] [Java API] Rough edges when recreating a DataFile that is partitioned by month or hour [iceberg]

via GitHub Wed, 01 Jan 2025 08:35:11 -0800


ahmedabu98 opened a new issue, #11900:
URL: https://github.com/apache/iceberg/issues/11900


   ### Apache Iceberg version
   
   1.7.1 (latest release)
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   Part of our workflow in Apache Beam's Iceberg connector requires recreating 
DataFiles, but this process isn't smooth when the file is partitioned by month 
or hour. See the following reproducible code:
   
   ```java
   org.apache.iceberg.Schema schema =
       new org.apache.iceberg.Schema(
           Types.NestedField.required(1, "month", 
Types.TimestampType.withoutZone()),
           Types.NestedField.required(2, "hour", 
Types.TimestampType.withoutZone()));
   PartitionSpec spec = 
PartitionSpec.builderFor(schema).month("month").hour("hour").build();
   Table table = catalog.createTable(TableIdentifier.parse("db.table"), schema, 
spec);
   
   LocalDateTime val = LocalDateTime.parse("2024-10-08T13:18:20.053");
   Record rec = GenericRecord.create(schema).copy(
           ImmutableMap.of(
                   "month", val,
                   "hour", val));
   Record partitionableRec = getPartitionableRecord(rec, spec, schema);
   PartitionKey pk = new PartitionKey(spec, schema);
   pk.partition(partitionableRec);
   DataWriter<Record> writer =
       Parquet.writeData(
               table
                   .io()
                   
.newOutputFile(table.locationProvider().newDataLocation(spec, pk, "test_file")))
           .createWriterFunc(GenericParquetWriter::buildWriter)
           .schema(table.schema())
           .withSpec(table.spec())
           .withPartition(pk)
           .overwrite()
           .build();
   writer.write(rec);
   writer.close();
   DataFile file = writer.toDataFile();
   
   // recreate data file using the original file
   DataFiles.builder(spec)
       .withPath(file.path().toString())
       .withFormat(file.format())
       .withPartition(file.partition())
       .withFileSizeInBytes(file.fileSizeInBytes())
       .withRecordCount(file.recordCount())
       .withPartitionPath(spec.partitionToPath(file.partition()))
       .build();
   ```
   
   The last bit fails with the following error:
   ```
   java.lang.NumberFormatException: For input string: "2024-10"
        at 
java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.base/java.lang.Integer.parseInt(Integer.java:652)
        at java.base/java.lang.Integer.valueOf(Integer.java:983)
        at 
org.apache.iceberg.types.Conversions.fromPartitionString(Conversions.java:51)
        at org.apache.iceberg.DataFiles.fillFromPath(DataFiles.java:86)
        at 
org.apache.iceberg.DataFiles$Builder.withPartitionPath(DataFiles.java:266)
   ```
   
   I would expect that the result of `spec.partitionToPath(file.partition())` 
could be naturally used when recreating the DataFile, but the [logic 
here](https://github.com/apache/iceberg/blob/e3f50e5c62d01f3f31239d197ef281fc36cf31fa/core/src/main/java/org/apache/iceberg/DataFiles.java#L78-L87)
 doesn't seem to be robust enough.
   
   We've been able to use this [work 
around](https://github.com/apache/beam/blob/18ec3317e500a6fee72fc8c24552c21808437bef/sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/RecordWriterManager.java#L237-L259),
 replicated below: 
   <details>
   <summary><b>Work around</b></summary>
   
   ```java
   static String getPartitionDataPath(
       String partitionPath, Map<String, PartitionField> partitionFieldMap) {
     if (partitionPath.isEmpty() || partitionFieldMap.isEmpty()) {
       return partitionPath;
     }
     List<String> resolved = new ArrayList<>();
     for (String partition : Splitter.on('/').splitToList(partitionPath)) {
       List<String> nameAndValue = Splitter.on('=').splitToList(partition);
       String name = nameAndValue.get(0);
       String value = nameAndValue.get(1);
       String transformName =
           
Preconditions.checkArgumentNotNull(partitionFieldMap.get(name)).transform().toString();
       if (Transforms.month().toString().equals(transformName)) {
         int month = YearMonth.parse(value).getMonthValue();
         value = String.valueOf(month);
       } else if (Transforms.hour().toString().equals(transformName)) {
         long hour = ChronoUnit.HOURS.between(EPOCH, LocalDateTime.parse(value, 
HOUR_FORMATTER));
         value = String.valueOf(hour);
       }
       resolved.add(name + "=" + value);
     }
     return String.join("/", resolved);
   }
   ```
   </details>
   
   But I would expect the Iceberg API to take care of this by itself.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [X] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] [Java API] Rough edges when recreating a DataFile that is partitioned by month or hour [iceberg]

Reply via email to