Re: [PATCH 0/5] Improve text protocol

Bill Spitzak Tue, 16 Apr 2013 09:07:41 -0700

On 04/16/2013 01:16 AM, Jan Arne Petersen wrote:

But we still need to think about how to handle invalid byte sequences
anyways. What do we expect a toolkit to do when text with invalid byte
sequences is inserted with commit_string? How to handle
delete_surrounding_text with the byte offsets not matching code points?
Should the toolkit ignore such requests or should we leave that as
undefined behavior?

You seem to be under the impression that it is impossible to edit textunless it is converted from UTF-8 to some other form? You do know thatthere can be encoding errors in UTF-16, right?

My recommendation is that the editor store UTF-8 and preserve errorbytes. Handling of errors is a *DISPLAY* problem, not a storage problem.

Errors should show a single error glyph for each byte in the error. Forinstance the sequence 0xE0,0xC0,0x20 is two error bytes followed by aspace (not a single error followed by a space or a single error as somesystems will do). The reason for this rule is to allow bi-directionalparsing of text with errors in it without having to look ahead more than4 bytes and to match the UTF-16 encoding I describe below.

If you have old code that cannot handle Unicode unless it is translatedto UTF-16 or UTF-32, then I recommend each error byte be turned into0xDCxx where xx is the error. This is the scheme used by Python, thenice thing is that it is somewhat possible to invert the transformation,and the result is invalid UTF-16 as well (it is not possible to make itfully-invertible unless you disallow UTF-8 encoding of these codepoints, which now means you cannot store invalid UTF-16 in UTF-8, whichis a much more serious problem as Windows allows filenames with invalidUTF-16 in them).

If text is only to be displayed another possibility is to display eacherror byte (and translate to UTF-16) by looking them up in the CP1252character set. This will allow a huge majority of existing 8-bit encodedtext to display correctly and thus remove most of the need to know iftext is not in UTF-8. It is a little risky however if further processingassigns any important meaning to ISO-8859-1 characters (using CP1252besides making Windows text display correctly also hides the dangerousNEL and CSI characters).

You will have to transform the text and the offsets you receive in theinput method events to UTF-16 and UTF-16 offsets. However at least bothtransforms are done in the same place, so even if you don't agree withthe above proposed scheme for transformation, it will at least work.




_______________________________________________
wayland-devel mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/wayland-devel

Re: [PATCH 0/5] Improve text protocol

Reply via email to