goel-skd opened a new issue, #808:
URL: https://github.com/apache/iceberg-cpp/issues/808
### Context
Issue #613 asks for case-insensitive field matching consistent with
iceberg-java and iceberg-python (both Unicode-aware), with `İ` (U+0130) as the
example. PR #760 (Part of #613) made `StringUtils::ToLower` Unicode-aware using
utf8proc **simple (1:1)** case mapping and added allocation-free ASCII fast
paths.
This issue captures the **design and remaining plan** to reach full parity
and tracks the follow-up PRs.
### Remaining gap
utf8proc's simple mapping still diverges from java for the few code points
where simple ≠ full case mapping — chiefly `İ`:
| input | iceberg-cpp (simple) | iceberg-java `toLowerCase(Locale.ROOT)` /
Python `str.lower()` |
|---|---|---|
| `İ` (U+0130) | `i` (U+0069) | `i̇` (U+0069 U+0307) |
So `EqualsIgnoreCase("İD", "id")` is **true** in iceberg-cpp but **false**
in java/python — the inconsistency #613 is about.
### Design questions
- Match iceberg-java `toLowerCase(Locale.ROOT)` exactly; confirm the
operation PyIceberg uses for matching and that it agrees.
- Full lowercase mapping vs. Unicode case folding: utf8proc offers full case
folding (`utf8proc_map` + `UTF8PROC_CASEFOLD`); verify it reproduces the
java/python result, or add a small explicit mapping.
- Keep the ASCII fast path; stream the non-ASCII path rather than
materialize.
### Work Items
- [ ] Full case mapping to close the `İ` / java-parity gap
- [ ] Streaming / allocation-free non-ASCII comparison in `EqualsIgnoreCase`
/ `StartsWithIgnoreCase` (deferred from #760)
### References
- Issue: #613 (origin)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]