wombatu-kun opened a new pull request, #16349:
URL: https://github.com/apache/iceberg/pull/16349

   ## What & why
   
   **This implements the existing `// TODO: direct conversion from string to 
byte buffer` in `SparkValueWriters`.** The Spark data layer converted UUID 
values through an intermediate `String` and a `java.util.UUID` object on every 
value, in both directions: write did `UUID.fromString(s.toString())` then 
re-serialized the two longs to 16 bytes; read did 
`UUIDUtil.convert(buf).toString()` then wrapped that as `UTF8String`. The UUID 
arrives as the ASCII bytes of its canonical string and leaves as 16 raw bytes 
(and vice versa), so the `String`/`UUID` objects are pure per-row allocation 
overhead.
   
   ## Changes
   
   - Add `UUIDUtil.convertToByteBuffer(byte[] uuidStringBytes, ByteBuffer 
reuse)` — parses the 36 ASCII bytes of a canonical UUID string directly into 
the 16-byte big-endian form.
   - Add `UUIDUtil.convertToStringBytes(ByteBuffer uuidBytes, byte[] reuse)` — 
renders 16 bytes back to the 36 ASCII bytes of the canonical string.
   - Rewire all UUID read/write sites in Avro/Parquet/ORC 
`Spark*Readers`/`Spark*Writers` for Spark 3.4, 3.5, 4.0, 4.1 to use these 
helpers.
   - Add `TestUUIDUtil` coverage for the new methods.
   
   ## Correctness
   
   Both helpers pivot on the `(mostSigBits, leastSigBits)` long pair: the 
parser reproduces `java.util.UUID.fromString` (parse `[0,8)` → `<<16 | [9,13)` 
→ `<<16 | [14,18)` for MSB; `[19,23)` → `<<48 | [24,36)` for LSB) and then 
`putLong(0, msb); putLong(8, lsb)` exactly as the previous 
`convertToByteBuffer(UUID, reuse)`; the formatter is the inverse and matches 
`UUID.toString()`. The output is therefore byte-for-byte identical to the 
previous code. The write side keeps the reusable thread-local buffer; the read 
side must allocate a fresh array because `UTF8String.fromBytes` wraps without 
copying (a reused buffer would alias across rows).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to