The current data serializer support 4 different encodings (in addition to
the 3 or 4 other encodings I have found elsewhere in Geode). It supports
encoding ASCII prefixed with an uint16 length (64K), ASCII prefixed with an
uint32 length, Java Modified UTF-8 with unit16 length (64k) and finally
UTF-16 with uint32 length (count of UTF-16 code units, not bytes). When
serializing a string it is first scanned for any non-ascii characters and
the number of bytes a modified UTF-8 encoding would require is calculated.
If the encoded length is equal to the original length then it is ASCII and
if the length is less than 2^16 then it uses the 16-bit length version
otherwise the 32-bit length version. If the encoded length is greater than
original but less than 2^16 then modified UTF-8 is used with 16-bit length
otherwise UTF16 and 32-bit length is used.

When working with non-Java clients dealing with modified UTF-8 requires
conversion between it and standard UTF-8 as modified UTF-8 encodes the NULL
character in two bytes. Only Java's DataInput/Output and JNI use this
encoding and ti was intended for internal and Java serialization only. The
StreamReader/Writer use standard UTF-8. Since our serialization expects
modified UTF-8 strings to be prefixed with at 16-bit length care has to be
taken to calculate the modified length up front (or seek the buffer and
rewrite the length if nulls are encountered). Since the modified length may
vary from the standard length care must then be take to make sure the if
the string must be truncated to fit in the 16bit limit that character is
not truncated in a multibyte sequence.

Encoding in UTF-16 isn't all that bad except that it is mostly wasted space
when strings are ASCII. There are no real encoding issues between languages
since most support it or use it their internal representation, like Java
does. But we are talking about serialization here and typically space is
the constraint. Most latin characters are low enough in the basic plain to
be encoded in 2 bytes of UTF-8 and take up no more space than the UTF-16
encoded version. Other characters will take up more space.

Since we took the care to optimize ASCII one can assume we figure out that
ASCII was our most common character sets. Regardless of the correctness of
this assertion it makes no sense to treat ASCII and UTF-8 streams
differently as ASCII can be fully encoded byte for byte in UTF-8 without
any overhead.

So what I would like to propose is that we deprecate all these methods and
replace them with standard UTF-8 prefixed with uint64 length. It is
preferable that the length be run length encoded to reduce the overhead of
encoding small strings. Why such a large length, well consider that
different languages have different limits as well as Java stores strings
internally as UTF-16.

A java UTF-16 string has max length of 2^31-1, encoded in UTF-8 it would
have a maximum, though highly improbably, length of 2^33-1. Serializing as
UTF-8 with a uint32 length limits the max Java string length to 2^29−1
or 536870911 UTF-16 code points. This is probably a reasonable limitation
but we have the technology to do better. ;) Since the server is Java it is
reasonable to limit the max string length we serialize consistent therefore
we need at least 33 bits of length.

For reference a C++11 std::basic_string has the max capacity that is
platform dependent but on 64bit linux and GCC it is 2^63-1. The
basic_string can be UTF-8, UTF-16 or UTF-32

The important part of this proposal is to convert everything to using
standard UTF-8 and deprecate all the other methods. I would ask that we
drop the other methods completely at the next major release. Not having to
implement 4 encodings in each of our clients will help development of new
clients. Not having to translate between standards and non standards string
types will help performance and reduce coding errors. All the other string
encodings I have found should be handled in the new protocol we are working
on, which is now using standard UTF-8, and are therefore outside the scope
of this proposal and discussion.

-Jake

Reply via email to