wombatu-kun opened a new pull request, #16738: URL: https://github.com/apache/iceberg/pull/16738
Follow-up to #16735, which replaced the per-node `String.format` recursion in `HiveSchemaUtil.convertToTypeString` with a single `StringBuilder`. This applies the same optimization to the analogous Glue path, `IcebergToGlueConverter.toTypeString`, which runs on every Glue create/update commit (via `setTableInputInformation` -> `toColumns`, which also re-converts every schema in the table's schema history). The previous code allocated at every level of the type tree: a `String.format` per node (format-template parse, varargs array, autoboxing of decimal precision/scale, locale lookup), a `Stream`/`Collector`/`StringJoiner` per struct, and it re-copied each child type string into its parent (superlinear for deeply nested types). The rewrite threads a single `StringBuilder` through the recursion and appends directly, so each character is written once. The produced type strings are byte-identical, and `toTypeString`'s signature and all call sites are unchanged. This is a performance and readability change only, with no behavior change: Glue's `default` case already rendered VARIANT and other types without error, so there is no correctness component here. ## Benchmark A JMH benchmark over three schema shapes (avg time and `gc.alloc.rate.norm`, JDK 21, 1 fork, 5+5 iterations, `-prof gc`). The benchmark is not included in this PR. | Schema shape | avg time us/op (before -> after) | alloc B/op (before -> after) | |---|---|---| | `WIDE_FLAT` (200 primitive cols, control) | 52.1 -> 48.3 (-7%) | 141,384 -> 138,984 (-2%) | | `DEEPLY_NESTED` (~10-deep struct/list/map) | 6.1 -> 1.1 (-82%) | 21,008 -> 2,680 (-87%) | | `WIDE_NESTED_MIX` (100 mixed cols) | 68.5 -> 28.4 (-59%) | 127,256 -> 76,968 (-40%) | `WIDE_FLAT` is a control dominated by per-column `Column`/`ImmutableMap` building that this change does not touch, so it stays roughly flat; the win grows with nesting depth and the number of nested columns. Allocation figures are deterministic; the `WIDE_NESTED_MIX` baseline time had high run-to-run variance, so its allocation reduction is the reliable measure there. Existing `TestIcebergToGlueConverter` continues to pass, pinning the byte-identical output. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
