goel-skd opened a new issue, #808:
URL: https://github.com/apache/iceberg-cpp/issues/808

   ### Context
   
   Issue #613 asks for case-insensitive field matching consistent with 
iceberg-java and iceberg-python (both Unicode-aware), with `İ` (U+0130) as the 
example. PR #760 (Part of #613) made `StringUtils::ToLower` Unicode-aware using 
utf8proc **simple (1:1)** case mapping and added allocation-free ASCII fast 
paths.
   
   This issue captures the **design and remaining plan** to reach full parity 
and tracks the follow-up PRs.
   
   ### Remaining gap
   
   utf8proc's simple mapping still diverges from java for the few code points 
where simple ≠ full case mapping — chiefly `İ`:
   
   | input | iceberg-cpp (simple) | iceberg-java `toLowerCase(Locale.ROOT)` / 
Python `str.lower()` |
   |---|---|---|
   | `İ` (U+0130) | `i` (U+0069) | `i̇` (U+0069 U+0307) |
   
   So `EqualsIgnoreCase("İD", "id")` is **true** in iceberg-cpp but **false** 
in java/python — the inconsistency #613 is about.
   
   ### Design questions
   
   - Match iceberg-java `toLowerCase(Locale.ROOT)` exactly; confirm the 
operation PyIceberg uses for matching and that it agrees.
   - Full lowercase mapping vs. Unicode case folding: utf8proc offers full case 
folding (`utf8proc_map` + `UTF8PROC_CASEFOLD`); verify it reproduces the 
java/python result, or add a small explicit mapping.
   - Keep the ASCII fast path; stream the non-ASCII path rather than 
materialize.
   
   ### Work Items
   
   - [ ] Full case mapping to close the `İ` / java-parity gap
   - [ ] Streaming / allocation-free non-ASCII comparison in `EqualsIgnoreCase` 
/ `StartsWithIgnoreCase` (deferred from #760)
   
   ### References
   
   - Issue: #613 (origin)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to