On Thu, 27 Sep 2012 11:52:26 +0700 Nathan Wells <sungk...@gmail.com> wrote:
>> 1. If you are shutting off the ICU breakiterator for text following, >> we >> should probably also do it for text preceding. Thus if there is a >> ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break >> iteration is disabled for the whole sentence. > Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU > break iteration should be disabled for the whole sentence. What is the logic of this? The use cases I see are: 1) The user always marks word breaks with ZWSP. In this case, the ideal is to switch off the break iterator for the language. 2) The user never marks word breaks. In this case, the user is totally dependent on the break iterator, and cannot be helped when it fails. 3) The user only marks word breaks and non-word breaks when the iterator fails. In this case, the iterator need only be switched off from the point of override until it can clearly re-synch. The obvious re-synching points are word external punctuation, such as end-of-line, white space, quotation marks, commas and dandas (and as dandas I would include U+0E2F THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5 KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai ฯลฯ and ฯเปฯ). Now, it may be easier to explain the rule if it applies to the whole 'word' - for what we are looking at is pretty much a 'word' as understood by dictionariless editors. 4) Different parts of the text comes from different sources - some mark word breaks, others expect the application to correctly identify them. A ZWSP in a chunk of text would then tag the text as having come from a a user in case 1 or 3; we have no reliable way of distinguishing the two cases. A WJ (U+2060) or ZWNBSP (U+FEFF) (when not a BOM, so paragraph initial is suspect) would strongly suggest use case 3 - but might occur in use case 1 if the user has had to fight a break iterator. (end of use cases) Considering these four use cases, it seems simplest to let ZWSP, WJ and ZWNBSP disable the iterator for the extent of the dictionariless word in which it occurs. What is the definition of an ICU sentence boundary? I see no evidence from CLDR 2.9 that it should be even approximately right for Khmer (or Thai). Splitting Thai text into sentences is known to be challenging - we can therefore expect different applications to split text differently. The one downside I can see to my suggestion is that if all word boundaries are marked, switching the iterator off dictionariless word by dictionariless word will require slightly greater use of WJ, for a ZWSP later in the sentence will not necessarily be in the same dictionariless word. A related issue that seems not to being handled is repetition mark U+0E46 THAI CHARACTER MAIYAMOK. It should be separated from the preceding alphabetic characters by a space, but Libreoffice doesn't recognised the sequence as a possible continuation of the word. Sometimes it is a necessary part of a word. I don't know what the situation is in Khmer. Richard. _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice