wombatu-kun opened a new pull request, #16738:
URL: https://github.com/apache/iceberg/pull/16738

   Follow-up to #16735, which replaced the per-node `String.format` recursion 
in `HiveSchemaUtil.convertToTypeString` with a single `StringBuilder`. This 
applies the same optimization to the analogous Glue path, 
`IcebergToGlueConverter.toTypeString`, which runs on every Glue create/update 
commit (via `setTableInputInformation` -> `toColumns`, which also re-converts 
every schema in the table's schema history).
   
   The previous code allocated at every level of the type tree: a 
`String.format` per node (format-template parse, varargs array, autoboxing of 
decimal precision/scale, locale lookup), a `Stream`/`Collector`/`StringJoiner` 
per struct, and it re-copied each child type string into its parent 
(superlinear for deeply nested types). The rewrite threads a single 
`StringBuilder` through the recursion and appends directly, so each character 
is written once. The produced type strings are byte-identical, and 
`toTypeString`'s signature and all call sites are unchanged.
   
   This is a performance and readability change only, with no behavior change: 
Glue's `default` case already rendered VARIANT and other types without error, 
so there is no correctness component here.
   
   ## Benchmark
   
   A JMH benchmark over three schema shapes (avg time and `gc.alloc.rate.norm`, 
JDK 21, 1 fork, 5+5 iterations, `-prof gc`). The benchmark is not included in 
this PR.
   
   | Schema shape | avg time us/op (before -> after) | alloc B/op (before -> 
after) |
   |---|---|---|
   | `WIDE_FLAT` (200 primitive cols, control) | 52.1 -> 48.3 (-7%) | 141,384 
-> 138,984 (-2%) |
   | `DEEPLY_NESTED` (~10-deep struct/list/map) | 6.1 -> 1.1 (-82%) | 21,008 -> 
2,680 (-87%) |
   | `WIDE_NESTED_MIX` (100 mixed cols) | 68.5 -> 28.4 (-59%) | 127,256 -> 
76,968 (-40%) |
   
   `WIDE_FLAT` is a control dominated by per-column `Column`/`ImmutableMap` 
building that this change does not touch, so it stays roughly flat; the win 
grows with nesting depth and the number of nested columns. Allocation figures 
are deterministic; the `WIDE_NESTED_MIX` baseline time had high run-to-run 
variance, so its allocation reduction is the reliable measure there. Existing 
`TestIcebergToGlueConverter` continues to pass, pinning the byte-identical 
output.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to