RFD: UnicodeCharacter – Object representation of Unicode code points

Pavel Ponec Thu, 03 Jul 2025 08:09:39 -0700

Hi all,

I'd like to propose the addition of a new object-oriented abstraction for
representing full Unicode characters in Java: `UnicodeCharacter`.


This class addresses a fundamental limitation of the current `Character`
type, which wraps a single `char` and therefore cannot properly represent
Unicode characters outside the Basic Multilingual Plane (BMP). With the
growing importance of supplementary characters (e.g., emoji, non-Latin
scripts, rare CJK glyphs), a more complete and object-oriented Unicode
abstraction would be beneficial to the JDK.

### Motivation

The `Character` type is limited to 16-bit `char` units, and cannot
represent characters requiring surrogate pairs (code points > U+FFFF). Java
developers working with text must often deal with `codePointAt`, `toChars`,
and `offsetByCodePoints`, resulting in fragile and error-prone code.
Furthermore, there's no immutable object type that cleanly encapsulates a
single logical Unicode character.

### Proposed Class: UnicodeCharacter

This proposal introduces a final, immutable class that wraps a valid
Unicode code point and exposes convenient methods to work with it.

A reference implementation is available here:
https://github.com/pponec/ujorm/blob/master/project-m2/ujo-tools/src/main/java/org/ujorm/tools/common/UnicodeCharacter.java

Highlights:

```java
public final class UnicodeCharacter implements CharSequence,
Comparable<UnicodeCharacter> {

    public static UnicodeCharacter of(final int codePoint);
    public static UnicodeCharacter of(final CharSequence text, final int
unicodeIndex);

    public int codePoint();
    public char[] toChars();
    public int charCount();
    public boolean equals(char c);

    @Override
    public String toString();
    @Override
    public int length();
    @Override
    public char charAt(int index);
    @Override
    public CharSequence subSequence(int start, int end);
}
Benefits
Proper support for the full Unicode range, including supplementary
characters.

Immutable and type-safe object model.

Simpler and safer text iteration and processing.

Aligns well with modern Java idioms, e.g. Stream<UnicodeCharacter> from a
String.

Object-oriented alternative to repeated Character.toChars(...),
codePointAt(...), etc.

Compatibility
The proposed class is entirely new and doesn't break any existing APIs. It
complements existing types and uses only standard Java APIs. It can be
introduced in the java.lang or java.text package without VM-level changes.

Adoption
This type can be used by libraries, UI frameworks, editors, and any
text-processing tools where proper Unicode character semantics are
critical. It promotes correctness in multilingual and emoji-rich
applications.

Please let me know if there's interest. I'm happy to further develop this
idea into a JEP if the community agrees it's worth exploring.

Best regards,
Pavel Ponec

ppo...@gmail.com

RFD: UnicodeCharacter – Object representation of Unicode code points

Reply via email to