On Thu, 27 Sep 2012 21:08:13 +0700 Nathan Wells <sungk...@gmail.com> wrote:
> Firstly, you are right, I was mistaken about ICU and the breakiterator > working for sentences (I just tried it right now and it does work, > but just not with the normal "khan" or "period" of Khmer rather it > works with Latin sentence markers which is not enough). I had > thought when we put in the code for the breakiterator that it also > covered the sentence, but I guess not (I will work towards getting it > working for Khmer). It may be worth modifying the CLDR definition - sentence breaks can be customised, though it is presently only done for Greek. However, if you want Khmer *sentence* rather than *clause* breaking, it will need a lot of work - papers are still being published on breaking Thai into sentences (e.g. www.mt-archive.info/Coling-2010-Slayden.pdf ). > In response to your comments: > > > 1) The user always marks word breaks with ZWSP. > > In this case, the ideal is to switch off the break iterator for the > > language. > > > There is some truth to this - and that is why I had it as my last > option (just turning the whole thing off). But the ICU breakiterator > for Khmer actually works quite well with normal language - it breaks > down when there are proper names. So turning it off is an option, but > not the most ideal solution. Some users will continue to always mark > breaks with a ZWSP (for full control), but I also think having the > option to turn it off for more complex sentences would be ideal. > > > 2) The user never marks word breaks. > > In this case, the user is totally dependent on the break iterator, > > and cannot be helped when it fails. > > As I said above, I think a both/and solution would be idea for Khmer. > But if in the end it would work better for Thai to have and "off" and > "on" option only, that would be fine for Khmer as well for now, until > we can come up with a more ideal solution. > > > > 3) The user only marks word breaks and non-word breaks when the > > iterator fails. > > The problem with this in Khmer is the user cannot tell when the > breakiterator fails, unless it is on a line-break. A word could be > broken up into three parts and the user would never know it. I usually notice iterator failures in Thai with unrecognised words, which prompts red ink over strange extents. Usually the words are not recognised because they're misspelt, but not always. The problem I see in Thai is usually not so much as extra word boundaries as misplaced word boundaries. > Actually, if users could see where the > breakiterator is breaking words, that would simplify things a lot. That is a very significant observation. > The only problem with this would be at the beginning of a document or > the beginning of any new "re-syncing" segment because you might run > into something like this: > User input (example in English so others can make sense of it I hope): > wordwordwordwordword. > How the sentence is broken up by the breakiterator: wo r d word word > wo rd word. > User adds ZWSP to fix broken word on line-break: wo r d word word > ZWSPwordword. This example confuses me. The problem here seems to be extra word breaks rather than missing word breaks, and I don't see how confirming a word break helps. > But user has no idea the first word is broken incorrectly and that it > is also spelled incorrectly. > This is why it would be best (I think) as Martin suggested that when > a ZWSP is detected it also turn off break iteration for the previous > words up until a re-sync point. This would practicly give the user > an "off" option for the whole document if they so chose, and without > the confusion of having to find some option in the Tools menu to turn > it on or off - it would just be automatic, depending on the user's > habit. I was clearly not clear enough. In the example above, 'wordwordwordwordword' is what I would call a dictionariless word - a word-breaker without a dictionary (e.g. a shell's parser) would see it as just one 'word'. Therefore, once ZWSP is inserted and word-breaking disabled, dictionary-based word-breaking is not applied to wordwordwordZWSPwordword, and, typically, red squiggles appear under wordwordword and wordword. The boundary may be revealed by a phase discontinuity or gap in the squiggle. Under the proposed scheme, user has to introduce another three ZWSPs even if the dictionary contains all the words. > I agree with this: > > > Considering these four use cases, it seems simplest to let ZWSP, WJ > > and ZWNBSP disable the iterator for the extent of the > > dictionariless word in which it occurs. > Except, it also should disable the breakiterator up to the previous > re-sync point... But that is what I meant! > But actually, there is a rule in ICU for the MAIYAMOK > so unless that is not working properly, I am not sure why LibreOffice > doesn't break correctly... I'll have to look further into this - and check that misbehaviour is still happening. Squiggly lines is what I chiefly remember. There may also be a Hunspell issue - the entries in the dictionary don't have spaces before maiyamok. The difference between finding word boundaries and finding line boundaries may be significant here. Richard. _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice