Bug#692241: libexttextcat: botched encoding in Polish LM

Caolán McNamara Mon, 05 Nov 2012 06:15:24 -0800

Thanks, fixed upstream now as
http://cgit.freedesktop.org/libreoffice/libexttextcat/commit/?id=07ad459ca83bddde8dcfad5535b3260386d222ff


re "I wonder if the language models shouldn't be somehow automatically
rebuilt from the ShortTexts/*.txt files. (Encoding of pl.txt appears to
be correct.)"

Not all of the .LM's can be generated from the short texts, e.g. some of
the very similar languages need a bit of extra training and tweaking to
distinguish them from eachother. The short texts are used in the test
suite.

FWIW, nearly all the new language models that *I* added are derived from
those ShortTexts (see the README), but lots of the older ones, including
the Polish one, came from libtextcat originally and were trained with
some unknown data. Generally I've assumed that the preexisting ones are
of better quality than one trained on the text of the UDHR and left them
alone.

C.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#692241: libexttextcat: botched encoding in Polish LM

Reply via email to