Hi - > JSON has been targeted at the Windows/Java UTF-16 world, there is always > going to be a mismatch if you try to represent it in UTF-8 or anything > that doesn't have surrogate pairs.
The JSON RFC8259 8.1 mandates UTF-8 encoding for situations like ours. > > Yes, and yet we have had the bidi situation recently where UTF-8 raw > > codes could visually confuse a human reader whereas escaped \uXXXX > > wouldn't. If we forbid \uXXXX unilaterally, we literally become > > incompatible with JSON (RFC8259 7. String. "Any character may be > > escaped."), and for what? > > RFC 8259 says this: > > However, the ABNF in this specification allows member names and > string values to contain bit sequences that cannot encode Unicode > characters; for example, "\uDEAD" (a single unpaired UTF-16 > surrogate). Instances of this have been observed, for example, when > a library truncates a UTF-16 string without checking whether the > truncation split a surrogate pair. The behavior of software that > receives JSON texts containing such values is unpredictable; for > example, implementations might return different values for the length > of a string value or even suffer fatal runtime exceptions. > > A UTF-8 environment has to enforce *some* additional constraints > compared to the official JSON syntax. I'm sorry, I don't see how. If a JSON string were to include the suspect "\uDEAD", but from observing our hypothetical "no escapes!" rule they could reencode it as the UTF-8 octets 0xED 0xBA 0xAD. ISTM we're no better off. - FChE