On 04/16/2013 01:16 AM, Jan Arne Petersen wrote:

But we still need to think about how to handle invalid byte sequences
anyways. What do we expect a toolkit to do when text with invalid byte
sequences is inserted with commit_string? How to handle
delete_surrounding_text with the byte offsets not matching code points?
Should the toolkit ignore such requests or should we leave that as
undefined behavior?

You seem to be under the impression that it is impossible to edit text unless it is converted from UTF-8 to some other form? You do know that there can be encoding errors in UTF-16, right?

My recommendation is that the editor store UTF-8 and preserve error bytes. Handling of errors is a *DISPLAY* problem, not a storage problem.

Errors should show a single error glyph for each byte in the error. For instance the sequence 0xE0,0xC0,0x20 is two error bytes followed by a space (not a single error followed by a space or a single error as some systems will do). The reason for this rule is to allow bi-directional parsing of text with errors in it without having to look ahead more than 4 bytes and to match the UTF-16 encoding I describe below.

If you have old code that cannot handle Unicode unless it is translated to UTF-16 or UTF-32, then I recommend each error byte be turned into 0xDCxx where xx is the error. This is the scheme used by Python, the nice thing is that it is somewhat possible to invert the transformation, and the result is invalid UTF-16 as well (it is not possible to make it fully-invertible unless you disallow UTF-8 encoding of these code points, which now means you cannot store invalid UTF-16 in UTF-8, which is a much more serious problem as Windows allows filenames with invalid UTF-16 in them).

If text is only to be displayed another possibility is to display each error byte (and translate to UTF-16) by looking them up in the CP1252 character set. This will allow a huge majority of existing 8-bit encoded text to display correctly and thus remove most of the need to know if text is not in UTF-8. It is a little risky however if further processing assigns any important meaning to ISO-8859-1 characters (using CP1252 besides making Windows text display correctly also hides the dangerous NEL and CSI characters).

You will have to transform the text and the offsets you receive in the input method events to UTF-16 and UTF-16 offsets. However at least both transforms are done in the same place, so even if you don't agree with the above proposed scheme for transformation, it will at least work.



_______________________________________________
wayland-devel mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/wayland-devel

Reply via email to