The current data serializer support 4 different encodings (in addition to the 3 or 4 other encodings I have found elsewhere in Geode). It supports encoding ASCII prefixed with an uint16 length (64K), ASCII prefixed with an uint32 length, Java Modified UTF-8 with unit16 length (64k) and finally UTF-16 with uint32 length (count of UTF-16 code units, not bytes). When serializing a string it is first scanned for any non-ascii characters and the number of bytes a modified UTF-8 encoding would require is calculated. If the encoded length is equal to the original length then it is ASCII and if the length is less than 2^16 then it uses the 16-bit length version otherwise the 32-bit length version. If the encoded length is greater than original but less than 2^16 then modified UTF-8 is used with 16-bit length otherwise UTF16 and 32-bit length is used.
When working with non-Java clients dealing with modified UTF-8 requires conversion between it and standard UTF-8 as modified UTF-8 encodes the NULL character in two bytes. Only Java's DataInput/Output and JNI use this encoding and ti was intended for internal and Java serialization only. The StreamReader/Writer use standard UTF-8. Since our serialization expects modified UTF-8 strings to be prefixed with at 16-bit length care has to be taken to calculate the modified length up front (or seek the buffer and rewrite the length if nulls are encountered). Since the modified length may vary from the standard length care must then be take to make sure the if the string must be truncated to fit in the 16bit limit that character is not truncated in a multibyte sequence. Encoding in UTF-16 isn't all that bad except that it is mostly wasted space when strings are ASCII. There are no real encoding issues between languages since most support it or use it their internal representation, like Java does. But we are talking about serialization here and typically space is the constraint. Most latin characters are low enough in the basic plain to be encoded in 2 bytes of UTF-8 and take up no more space than the UTF-16 encoded version. Other characters will take up more space. Since we took the care to optimize ASCII one can assume we figure out that ASCII was our most common character sets. Regardless of the correctness of this assertion it makes no sense to treat ASCII and UTF-8 streams differently as ASCII can be fully encoded byte for byte in UTF-8 without any overhead. So what I would like to propose is that we deprecate all these methods and replace them with standard UTF-8 prefixed with uint64 length. It is preferable that the length be run length encoded to reduce the overhead of encoding small strings. Why such a large length, well consider that different languages have different limits as well as Java stores strings internally as UTF-16. A java UTF-16 string has max length of 2^31-1, encoded in UTF-8 it would have a maximum, though highly improbably, length of 2^33-1. Serializing as UTF-8 with a uint32 length limits the max Java string length to 2^29−1 or 536870911 UTF-16 code points. This is probably a reasonable limitation but we have the technology to do better. ;) Since the server is Java it is reasonable to limit the max string length we serialize consistent therefore we need at least 33 bits of length. For reference a C++11 std::basic_string has the max capacity that is platform dependent but on 64bit linux and GCC it is 2^63-1. The basic_string can be UTF-8, UTF-16 or UTF-32 The important part of this proposal is to convert everything to using standard UTF-8 and deprecate all the other methods. I would ask that we drop the other methods completely at the next major release. Not having to implement 4 encodings in each of our clients will help development of new clients. Not having to translate between standards and non standards string types will help performance and reduce coding errors. All the other string encodings I have found should be handled in the new protocol we are working on, which is now using standard UTF-8, and are therefore outside the scope of this proposal and discussion. -Jake